GPT-4’s secret

GPT-4’s secret

(thealgorithmicbridge.substack.com)

by kristianpaul

wokwokwok

I feel like it’s kind of sad and pathetic the blogs and people all clamouring to be “the one” that “breaks” this story and “outs openai” as having made gpt4 out of smaller components rather than a “big” breakthrough.

Do you have any real information about the architecture? Nope.

Do you have anything other than hearsay? Nope.

Do you know there isn’t actually some important breakthrough? Nope.

Is gpt4 still the best model available despite whatever shortcuts they may have taken? Yup.

…so… until we have some concrete, repeatable information, this is just a lot of hot air.

They did something.

No one knows what.

The results are better than what anyone else has done.

A bunch of people have speculations about what, but no one has been able to pull off an equivalent yet.

Same as yesterday.

bilekas

Add to that a paywall after the intro. Quite frustrating.

kristianpaul

https://archive.fo/FlBgV

http://archive.today/2023.06.24-152246/https://thealgorithmi...

swyx

> https://archive.fo/FlBgV

whoa. archive.fo skips the substack paywall?

ugjka

I get SSL_ERROR_NO_CYPHER_OVERLAP error on that one (FF)

Stagnant

There is some problem between archive and cloudflare DNS, I see people complain about it daily on this site. Here is how you fix it if you are not using cloudflare DNS and are still getting the error (and are using firefox):

1. Firefox settings -> type dns in search bar

2. Under DNS over HTTPS either turn it off or alternatively select NextDNS as provider in the increased protection box.

3. archive today should start working now

kristianpaul

http://archive.today/2023.06.24-152246/https://thealgorithmi...

ugjka

Error 1001 Ray ID: 7e2074e88c84b804 • 2023-07-05 14:56:47 UTC DNS resolution error

welp...

huhtenberg

Both links work just fine here. You probably over-blocked something somewhere ;)

ugjka

no other site has this problem

hpd

What FF version ? No issues here, I'm on 102 ESR

ugjka

115.0

retskrad

I have been extremely impressed with Bing with it's GPT integration. It's worlds better than the Free tier ChatGPT. Bing actually feels like talking to a human. My daughter uses it as a tutor for practising high school math. Whenever her textbook or teacher isn't doing enough to help her understand alien or difficult concepts, she has a long conversation with Bing where she asks it to break it down as much as possible. LLM's are truly changing education.

soligern

Obviously, the free tier is 3.5 and Bing runs on 4. That being said Bing isn’t as good as it used to be, the 5 conversation limit really hampers being able to nail down what you’re looking for and they e neutered it compared to when it first came out and was much better/different.

H8crilA

That's probably because it's GPT-4, not the free tier GPT-3.5.

juujian

I don't like the focus on search results. It could in so many cases just give you the information instead of spiking it with useless links

kordlessagain

> Second, let’s give credit where credit’s due. GPT-4 is exactly as impressive as users say. The details of the internal architecture can’t change that. If it works, it works. It doesn’t matter whether it’s one model or eight tied together.

If this is true, I wonder about how likely chain of thought is involved in passing data between the different models?

ttul

Chain of thought is an in-context technique. The ensemble of models concept that GPT-4 supposedly uses works within the model itself. My understanding of the relevant papers is poor, but my impression is that there are layers that help the model select which two of the 16 different “experts” should contribute to the generation of the next token.

The 16 models in the ensemble are all trained in the same way, but attend to the input data differently. It’s a little like how multi-headed attention works by splitting the input embeddings into typically 8 parts and having a set of query, key, and value vectors each train on their own 1/8th slice of the input embeddings. Although there is no “meaning” to these slices, nonetheless the KQV vectors for each head will learn different relationships between the inputs simply because they were trained differently.

aldousd666

I have often wondered about this: does padding and chopping to context size along arbitrary boundaries have lasting effect on the overall data (due to misalignment at the ends, some words or lines getting chopped in the middle, etc.) Does that net out, because we do it consistently, or do we lose vital information that's compounded across the entire training data set?

frankreyes

Like a binary tree of models?

sdwr

My guess is that it isn't doing a lot of self-interaction. I've been using it for coding, and the replies feel much too coherent to be either:

1. expanded from an outline (2-stage process)

2. being chained between models

If it was multi-stage, I'd expect some chinese whispers-style drift, where the plot gets lost slightly between steps. The responses I'm getting from GPT4 are focused and specific.

Likewise, if the responses were getting chained between models, I'd expect visible seams in tone / content in between.

My guess for how they're architecturing it, if it is 8 models, is either:

1. Each response handled by 1 model

2. Some kind of voting / confidence system that switches between models on the fly

ren_engineer

>the details of the internal architecture can't change that

GPT4 is definitely useful, but it points towards bad news for OpenAI and potentially the entire field. A lot of people really wanted to believe that OpenAI had some secret sauce that actually pointed towards a path at true AI.

Turns out they just poached Google's own researchers and did a better job at turning Google research into a product(all the papers for this type of architecture came from Google Brain, authors are now at OpenAI). OpenAI is doing impressive work on the practical side of AI but apparently nothing revolutionary in terms of research, which is why people are disappointed by this reveal.

aldousd666

True AI? What, like 'one model to rule them all?' Why does it matter how many models we are using? Do you have more than one computer in your computer? Is it a true computer?

germinalphrase

“…true AI.”

What do you understand this to mean?

charles_f

Is there a tool or technique called chain of thought, or are you talking about the colloquial concept as it relates to thinking?

seanhunter

It has a specific technical connotation but it does map exactly to what you think it is.[1]

Basically it's a trick that recognizes that language transformers only perform computation to generate words so for complex tasks you can get better results by asking the model to explain its chain of thought and only give the answer at the end. This has the effect of giving the model "time to think". If it didn't generate those words it wouldn't have anything to hang that computation off since it is fundamentally a word-generation model.

[1] https://arxiv.org/abs/2201.11903

kordlessagain

It's a technique, but could also be likened to how we think.

Here's a simple example: keyterms from text are extracted with text detection from an image. Those keyterms will sometimes have bad reads where "aligned AI" might pop out as "aligned Al". A subsequent "internal thought" would be formed and ask, "What's wrong with the 'aligned Al' keyterm?". If an updated response is returned, we use it instead of the original output.

ickelbawd

They’re referring to a method of prompting the model that encourages it to think through something step by step rather than spit out the answer.

I think this is the paper that really kicked off this technique: https://arxiv.org/abs/2201.11903

moab

https://ai.googleblog.com/2022/05/language-models-perform-re...

You can google for more results / papers.

ilaksh

That's drawing a lot of conclusions from a very vague one-line architectural description. The way those multiple models work together probably is innovative. Who knows.

I think a more interesting article would be about sketching out technically some potential ways it could work in detail for an open source effort to try to imitate it. That could be a broad range of things because we don't even know at what level are the multiple things working together implemented.

a_wild_dandan

Turns out GPT-4 was 8 kids in a trench coat. And hey, why not? If the brain can have regions, perhaps that's a likewise a useful abstraction for LLMs. I'd love more info on model performance differences between monolithic vs manifold/subdivided architectures. Does model composition give us a "more than the sum of its p̶a̶r̶t̶s̶ weights" phenomenon, or just some implementation conveniences?

Side note: I loved the George Hotz interview on Lex Fridman's podcast. I strongly disagree with some of his opinions, but thoroughly enjoyed hearing them. I highly recommend it, if you're looking for some entertainment.

jononomo

I'd be interested in listening to George Hotz without having Fridman involved at all.

fredguth

Interesting. I hated it. I usually like Lex Friedman’s interviews even (or even more when) I disagree with the interviewee. I agreed a bunch with Holtz but I could not build empathy.

I tried, but for me he is too childish despite his undeniable intelligence. I am sorry for him.

quonn

And it‘s expensive, especially the API. Everyone knows the scalability of plain ChatGPT and it‘s there as a fallback - and so we tend to forget that and kind of also attribute it to GPT-4. But it will take two orders of magnitude of performance improvements/hardware speedups to be cheap for many interesting cases, 3 orders of magnitude to be interactively/conversationally cheap. At least if the rate limit and official pricing reflect the underlying reality.

mcbuilder

Now we're seeing decent 13-40B models (within ChatGPT 3.5 level). 4-5bit GPTQ quantizations are working pretty well, so these models are actually fitting on consumer GPUs. So many new models on huggingface coming out everyday, it is hard to keep up with the foundation models. We are in a great time for regular users being able to play with LLMs. People are loading onto their consumer 4090 GPUs things models that would have been state of the art a couple years ago.

We're also plateauing with LLM performance, they aren't scaling past 200B it seems even with 1T+ token training (that's why they "copped out" and did MoE for ChatGPT)

I think LLM AI performance will likely stabilize, but I also think we'll easily get that 3 orders of magnitude in the next 5 years. So, maybe AI won't be super smart but it will be everywhere.

ianbutler

> We're also plateauing with LLM performance, they aren't scaling past 200B it seems even with 1T+ token training (that's why they "copped out" and did MoE for ChatGPT)

Can you point out to anything other than speculation by Geohot here? I heard the same thing, but all of this has been circling the twitter sphere and I haven't seen any supporting research to back up this claim.

mcbuilder

In my opinion these are the two alternatives and we're left making educated guesses. I mean it's clearly rumors, but look at the scientific results on your bread and butter language modeling tasks you see any clear basic architecture wins. For instance look at hellaswag results, https://rowanzellers.com/hellaswag/. GPT-4 is impressive, but RoBERTa is not far behind and that's from 2019. It's from 2019 is the point I'm trying to drive home. T5, RoBERTA, Transformer XL, all old as hell (for ML/AI) architectures but still pretty top contenders.

At this point I think we'd see more big and basic results at top conferences if we expect AI to keep scaling in "intellegence", but damn we're close to solving human language modeling in limited contexts.

That's still huge, along with the advances in computer vision in the last 10 years and generative art, the rate of breakthroughs is incredible, but we're also going to be hitting brick walls now and again.

og_kalu

Roberta is tuned on Hellaswag so the comparison means nothing. There's a big difference in the uality of responses between 3.5 and 4, nevermind anything before that.

og_kalu

>(that's why they "copped out" and did MoE for ChatGPT)

MoE outperform dense counterparts significantly after instruct tuning. https://arxiv.org/abs/2305.14705

photoGrant

substack just went medium? Would love to read, but not that much.

mediaman

The author chooses whether to make it free or not. It’s not like medium.

messe

You can use https://archive.today to work around the paywall.

FredPret

> "GPT-4 is, technically and scientifically speaking, hardly a breakthrough.

That’s not necessarily bad—GPT-4 is, after all, the best language model in existence—just… somewhat underwhelming."

This is an unreal take. Who cares if it meets some standard of cool internal design - it works amazingly well. All of biology is filled with examples of messy designs that work brilliantly in the real world.

swyx

hi, podcaster of this here: https://www.latent.space/p/geohot (https://news.ycombinator.com/item?id=36407269)

I've been criticized variously for not asking for more details or indulging in GPT4 rumor. for one I knew that George was mostly just repeating something he had heard and didnt have direct experience of, so I felt like I couldn't ask further detail without getting him in trouble. for the other a friend at OpenAI has dismissed the relevance of this detail (without providing any other specifics, just acknowledging that GPT5 will be a more substantial jump)

gok

If GPT-4 is really 2 trillion parameters, that implies it uses at least 32 A100-class GPUs (that's with 8-bit weights). Assuming they're each consuming at half their max TDP, that's around 6 kilowatts. It seems to spit out around 15-20 tokens per second, so sampling 1000 tokens probably consumes around 100 watt-hours of power. That's an entire high-end laptop battery, or enough power to move a Tesla half a mile.

OpenAI probably have some magic to batch requests under high load but still, pretty power hungry. A good amount of the $0.06/1k tokens they charge is probably just to cover electricity.

Comment was deleted :(

ttul

If this is indeed the secret sauce behind GPT-4, I would expect it to be replicated in short order, hopefully in open source. Iteration on this more successful architecture will then proceed more rapidly since the architecture will be open for a wide spectrum of researchers rather than closed off within the bounds of one company.

m3kw9

From a black box perspective gpt4 is a lot more capable than gpt3, everything else is just as useful as “cpu specs” to most people

m3kw9

That’s the secret, no need to pay to read. A mix of different models. Any more details is just a waste of time

RC_ITR

This whole '8 models not 1' thing is really confusing to me.

My understanding of the MoE paper is that it's 8 FFNN modules attached to a single embedding and attention module. In that sense it is a 1tn+ parameter model, but a subset of those FFNN parameters are trained on each token/used in inference.

OpenAI made a lot of noise about the power of scaling with raw compute, so I get why people are confused about why they are now in optimization mode, but the amount of disinformation I see about MoE right now is astounding.

aldousd666

I don't understand why anyone is acting like the emperor has no clothes just because the 'program' they had supposed was one thing is actually several modules of things. How is this different from literally any other software project? This isn't The Highlander. There can be more than one.

PaulHoule

I wanna know what they are doing for inference. Most chatbots are doing rather "greedy" beam inference but to do a really good job of writing you would write a first draft and then revise it. In the not-lame-paywalled accounts of geohot's revelations it is said that GPT-4 is using a 16-step process for inference but I'd like to know what they are really doing.

kordlessagain

This could be of interest: "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts"[1]

Looking at OpenAI's recent introduction of ChatGPT's chat history and sharing features, it's plausible to suggest that the creation of a set of related documents (including chat interactions and documents sourced from plugins) could pave the way for "domain expert" models to participate in a group alongside "frozen" or "foundation" models. These domain expert models could be models that receive document content for inference.

I've conducted experiments where I sent document fragments (sourced from a vector search) in one query and combined it with another query without including the fragments. The results suggest that this approach can occasionally enhance the quality of the outcomes and seems to prevent the model from generating wildly inaccurate responses.

[1] https://arxiv.org/pdf/2112.06905.pdf

pizzalife

Paywalled.

username3

Should we flag paywalled content posted on HN?

7ewis

Yes, to be honest I think they should be banned. Only post mirrors of the content bypassing the paywall.

pmontra

I share the same general feeling but then let's flag every paywalled link and not one of two at random. I was really surprised to see this flagged.

lcfcjs

[dead]

Crafted by Rajat

Source Code

hckrnws

GPT-4’s secret