hckrnws
Artificial General Intelligence remains a distant dream despite LLM boom
by Brajeshwar
Q: What happens when output from LLMs start feeding back into training new LLMs?
A: A self degrading BS feedback loop leading to eventual collapse of reliability.
It will take as much time and effort to verify output from a LLM as it does to do the research/legwork yourself.
Maybe more because you'll now have to filter out and reject all the plausible sounding BS created by LLMs.
This idea that LLMs will get progressively dumber because of a training feedback loop is silly.
If this ever did become become a genuine concern companies training LLMs would just use preexisting datasets and selectively add additional data that they can verify the quality of.
Not everything is Reddit. I doubt the NYT is going to start publishing BS. Scientific journals like Nature also aren't going to start publishing BS because of LLMs.
There will be plenty of great resources to train future iterations of LLMs, even if the days of just scraping the entire internet for a dataset has passed.
> I doubt the NYT is going to start publishing BS. Scientific journals like Nature also aren't going to start publishing BS because of LLMs.
Will they though? Leaving aside the NYT which I contend already publishes quite a bit of bull hockey. Here's how it goes.
Researcher: Provides outline of paper, GPT turn this into a scientific paper for me.
GPT: Turns it into a paper, but includes several not quite accurate statements, or hallucinates some information and facts, as well as fabricating just one or two citations.
Researcher: Gives it a once over, not a through review because that would defeat the point of using an LLM to generate the content in the first place. Paper gets submitted.
Reviewer: Receives paper, throws it into GPT, "summarize this for me", GPT spits out a summary that looks mostly good, the reviewer skims over the actual paper, there are no glaring problems it gets published.
End result. We now have a paper with fabricated information stated as fact published in a reliable place, that then is cited and referenced, resulting in a degradation of quality of the information. This repeats and repeats and suddenly whole conferences are built on entirely hallucinated information.
This idea that LLMs will get progressively dumber because of a training feedback loop is silly.
It's also a straw man. The concern isn't so much that that theses systems will move backwards, but rather that they will start saturating, and gradually degrade in terms of ROI (measured in terms of both cost/energy and the complexity of building of dealing with systems to deal with all poisoned data).
The commenter above was stretching a bit when they said "collapse", but I think they got the basic gist of the issue right.
Which is basically: diminishing returns.
Yes, companies training LLMs would not let this degradation of quality happen. But that's not the question this point is answering. It's an answer to the thought experiment of why LLMs won't be able to just become smarter and smarter by talking to themselves. The theory is that there is some point past which that will make them dumber rather than smarter, so it won't be done. We don't know for sure yet, but I personally find this theory compelling.
If this ever did become become a genuine concern companies training LLMs would just use preexisting datasets
So they will be cutoff from the internet and never consume new data from it?
Is this what they're doing now? I don't think so. Musk is threatening to sue for Twitter posts being used without permission for training purposes.
Seems like an incorrect model for what LLMs are. Do you imagine you can't learn english if you only read false statements?
Of what use is a text generator that only makes false statements?
> This idea that LLMs will get progressively dumber because of a training feedback loop is silly.
What's interesting is that it can get dumber due to consuming its own or other LLM's output.
We humans read our own output and talk to other humans. Never heard of anyone getting dumber from being social.
This is called "model collapse". It leads to worse performance.
I remember when people used to tell me this about wikipedia. Its just a bunch of random people on the internet editing each other without knowledge. It'll just create a mess and take longer than just looking it up in a highly curated encyclopedia.
The different is that wikipedia has true 'inputs' from the real world. Actual knowledge is being ingested to wikipedia from physical reality.
Chatbots can only consume what is digitally available. There is an inherent bias in what things are discussed on the internet (you would think our world is a cat-video obsessed world of opinionated political debators).
Now most of the content we see on the internet will soon be chatbot-generated content. Chatbots will regurgitate and reconsume their content many times.
Wikipedia doesn't have this issue. It is not a closed-loop internet phenomena.
Chatbots aren't really closed loop either, especially with reinforcement learning. Though it will quickly turn out to be the case that some fields get a lot more attention than others.
Which also isn't dissimilar from wikipedia.
Eh, Wikipedia is actually a bunch of highly curated fiefdoms.
>Q: What happens when output from LLMs start feeding back into training new LLMs?
Humans do this all the time. For example, fake news spreading and most people believing in it.
What's the point of constantly comparing this technology to humans?
If I said my calculator gives false numbers you wouldn't reply "humans are sometimes bad at math".
No, but if my LLM were bad at math, I might say "humans are sometimes bad at math". One major missing element in current ML approaches is tool usage and how to decide when to use a tool vs not using one. It is an area that is currently being explored and expanded quickly.
because the whole point of these systems is to take over menial tasks that humans are slow/bad/find boring. Same reason as calculators: they do something humans could do, but faster and more accurately, allowing us to focus our intelligence on higher order problems.
But some humans are skeptical and/or critical thinkers and will get this new information verified or disproven through other methods. Also, fake news spreading seems mainly an issue of media/social media. There is no information spreading mechanism set up yet for LLMs. Furthermore, to LLMs all information looks the same. To humans more emotionally charged information is more prominent.
True but it misses the point.
LLMs will only help fake news spread faster because they're dumb --- they can't tell the difference between real and fake.
They are indiscriminate multipliers of both real and fake data.
One difference being that we have previous education, culture and critical thinking allowing us to not fall into bullshit even in case it is being spread by more sources than the correct information. I'm not sure an AI being trained on the load of garbage that comes from social media today could develop that level of resistance.
You overestimate the average human's ability to critically think for themselves.
I wish people would just stop using the term AGI, because they are not able to use it in a precise way.
It's like they are conflating all aspects of cognition and life, and discounting the reasoning on the basis that GPT-4 doesn't have those other aspects.
No, LLMs do not have many of the cognitive abilities that humans or other animals have. LLMs also do not have a stream of subconscious experience and are not alive.
But yes, a few LLMs like GPT-4 can reason and solve problems in their own way.
Its not reasoning and solving problems. It's a statistical model. It gives the appearance of reasoning and solving problems, but the second you start asking it to solve problems that aren't in its training database it gives you nonsense.
I'm surprising at how little actual testing of an LLM's ability to reason based on well-established hard problems in AI (like halting) is happening. Philosophy of mind has produced a number of logic problems that are designed to test the reasoning ability of an intelligence, and instead LLMs get tested on content that is explicitly in their training dataset.
I'm a bit more open minded to what "reason" or our own understanding of our own "reason" might be.
I gave GPT-4 a bunch of code that had two regex functions to parse a given custom url. I also gave it in the same prompt an example of a url of what I wanted it to parse, and asked it to explain the code and tell me the answer.
It explained the code perfectly in moments and also gave me the correct answer.
Here was it's response -> https://pastebin.com/raw/CP9iZh7S
I would use this as an interview question, asking a junior dev to "reason" through the code, explain their thinking and without running it, tell me what they might think the answer would be. If I had to give them feedback for that answer, I would have commended their ability to "reason" about the problem. ¯\_(ツ)_/¯
(extra: The GPT-4 response knew to not bother with the second regex because it should only run the second if the first regex had a third part match, amazing...)
(side note: I can't wait until I can generate live documentation of every file in my code base)
I'm not surprised it can write that regex and explain it, because that's a pretty simple regex and regexes generally are very well documented. Here's something it can't do:
I was writing a python script that took a list of tuples with x elements. Each element in the tuple represented a value in a dataset, so the first dataset would be the list of elements at position 0 in each tuple, the second dataset would be the list of elements in positions 1 etc. I wanted to extract each dataset into a list, calculate the derivate of each element in the list, and then reassemble the original list of tuples with the derivates in place of the original values.
I tried to get ChatGPT to give me a one liner to do this for _hours_. It would generally get step one and two correct, splitting the list of tuples of x length into x lists, and then calculating the derivative of each element in the list, but inevitably at the last step it would do something screwy. Most commonly it would reassembly a list of tuples, except the elements of each tuple would be from the indexes in the _second_ list. So I would get a list of all the derivates in position one in every dataset. Even if I explained what was wrong with the output, it would apologize and then blithely give me the exact same answer! Why did it do this? Because it doesn't actually understand the prompt, and the answer to my question is not in its dataset in a way that can be statistically derived.
3.5 or 4?
And did you also try the problem with Co-Pilot?
I've tried it with both, and it makes the same error. Copilot doesn't even get close. This is just one example of many. Beyond a certain degree of complexity, for problems that aren't documented in some tutorial or answer somewhere in its training dataset, ChatGPT is not going to be able to help you. I've run up hard against its limits many of times, and I feel like most of the examples people provide are actually a lot simpler than they believe them to be.
I'd be curious to take a stab on your problem. I could arguably say it is simpler as it is just purely documented mathematics.
And/or you could use a prompt injection strategy to make sure it does have the available context to solve your problem.
e.g. a vector search for some text book snippet that gets injected in
BTW have you asked it about " davidjl"? Maybe ask it about SolidGoldMagikarp too.
> but the second you start asking it to solve problems that aren't in its training database it gives you nonsense
I would disagree.
I use it for code and it comes up with unique solutions to my situations that certainly have not been seen before in its training dataset.
When you inform it you are using a 3rd party UI component library, a 3rd party CSS library, and you need to alter the layout based on the current complex state of an Angular system, and it can provide HTML, CSS, and TypeScript code that works and is based on this unique arrangement in this system, I'd say that's reasoning on some level.
It's not copying and pasting from StackOverflow. It has to "understand" the scenario and implement it using my guidelines.
It will even take into account things like linting rules ("Never use the 'any' type") and it can do that too.
It is quite literally copying and pasting from StackOverflow, and any other code that it has been trained on. I think you're underestimating exactly how much code there is out there, and the degree to which most of it is boilerplate.
Here's an example of what it can't do. I was recently messing around with the Plaid API, building sqlalchemy models for the Plaid API objects and defining relationships between them. After about 45 minutes of playing around, I got about 80% of the way to what I created without ChatGPT. It saved me a bunch of time googling around for documentation, but it gave me an absolute garbage password encryption implementation that was literally copied out of an intro tutorial I had read elsewhere, didn't even seem to recognize that there is a Password field type in sqlalchemy_utils, and omitted a bunch of key aspects of the model relationships. It also used field arguments that were either deprecated or not best practice for what I wanted to do. I also had to ask it to model every object individually, even though I specifically asked it to model all of the fields necessary to implement transactions.
Now I could go and ask it to specifically modify those things, but it doesn't "know" the best way of building this set of models. It gives me the statistically determined output of what it has consumed from all of the answered questions on stackexchange and tutorials it has been fed, and provides a halfway decent starting point, but to get from there to a working application requires actually understanding why you might build a model one way rather than another. You need to understand exactly what behavior you want to see when you delete a user, and what needs to happen to all of the related objects. You need to know that id columns should be UUIDs and not integers or strings, which is something you commonly see in tutorials but you would (hopefully) never see in a production application. All of this knowledge is the result of years of experience building applications and knowing not only what to do, but why you do it and in what context. ChatGPT is very useful reference library. It blows google searching for examples out of the water, but you need to be very careful with the results, and very knowledgeable to use them properly.
> It is quite literally copying and pasting from StackOverflow
Not necessarily.
They've already said that certain emergent properties as the models have scaled beyond certain thresholds are responsible for some of the abilities. The ability to code as well as some of them can has been included in this.
> ChatGPT is very useful reference library. It blows google searching for examples out of the water, but you need to be very careful with the results, and very knowledgeable to use them properly.
I agree that you need to know what you're doing to make use of them. I've been a SWE for over two decades, so I certainly know what to feed them (and what not to - such as proprietary information) to get a coherent answer. Yes, they need a lot of context on a given issue, without that the model can only guess at what you are looking for (variable names, UUIDs for primary keys, schemas, interface definitions, 3rd party components, etc). And they will occasionally suggest using deprecated methods or not the latest suggested approach, due to the cutoff date.
However, I stand by the opinion that if given the right inputs, they are capable of unique solutions to complex issues. The models are seemingly capable of taking points A and B from one StackOverflow post and combining that with C from the language documentation and D from a third-party vendors site and combining them into a coherent answer.
Yes, the language model is certainly capable of assembling text from different sources and getting an answer that tends to be mostly coherent, but that is exactly what LLMs are designed to do. To the LLM is doesn't actually matter that all of that content is from different sources, all that matters are the statistical relationships it has derived for the various tokens in your prompt. But those are not unique solutions, that are existing solutions that it has cobbled together. It's been demonstrated repeatedly that when prompted for information that ChatGPT has not been trained on, it fails badly. It also fails at a lot of higher order writing analysis prompts, like those found on English literature essay exams.
I've been a software engineer for almost two decades, and I actually use ChatGPT daily. It's a very helpful reference, but fails early and often on a lot of tasks. I generally compare it to a tool like create-react-app, except generalized across all languages and frameworks. A very powerful tool, but it's a statistical machine learning algorithm, not an AI. It doesn't understand your prompts and it isn't reasoning about them.
> I use it for code and it comes up with unique solutions to my situations that certainly have not been seen before in its training dataset.
How do you know something similar isn't in its dataset?
I think we humans are a lot less original that we believe.
But what's the difference between reasoning etc. and giving the appearance of reasoning?
This is a key question. With the latest advanced LLMs we've created something which exceeds our pre-existing conceptual framework's scope leading us into Searle's 'Chinese Room' which philosophers of mind have been debating for decades.
What's funny is that ChatGPT is quite literally a Chinese room.
> Its not reasoning and solving problems. It's a statistical model.
One does not exclude the other.
"you are not reasoning you are just neurons firing in a big network" "its not reasoning its just flipping bits" "its cant be alive its just nand gates"
You brain is largely just a statistical model.
This general comment of "It can't handle things outside it's training database" is very common and also entirely wrong IMO.
Take the following question which I just wrote and I'm fairly certain hasn't been asked before:
> How many giraffes would it take to fly a plane in an imaginary world, assuming that the plane was designed to be flown by a giraffe. Assume that giraffe brains can be attached together to create a human-like intelligence assuming enough giraffes are connected, and assume that there are 500 passengers that will need to be served food (e.g. as per airline staff). Assume that the full airport is also staffed by giraffes.
> Use logic and reasoning to estimate the number of giraffes that would be required, and detail specific practical considerations for the design of the aircraft.
That's not been asked before - I just made it up. GPT-4 gives as good as an answer to this as I can come up with:
> This is an interesting thought experiment. It's important to note that it's highly speculative and relies on numerous assumptions about the cognitive capabilities of giraffes, their physical abilities, and the mechanics of designing and operating an aircraft specifically for them. However, assuming we are in this hypothetical scenario, let's break down some of the elements:
> Flying the Plane: Normally, a commercial aircraft is operated by two pilots. The main brain functions involved in flying a plane include spatial awareness, understanding complex systems, decision making, problem-solving, and communication. The human brain has approximately 86 billion neurons. In contrast, a giraffe's brain has approximately 1.2 billion neurons. Therefore, to achieve the complexity and cognitive capacity of one human pilot, approximately 72 giraffes would be needed (86 divided by 1.2). Since there are typically two pilots on a plane, this gives us a rough estimate of 144 giraffes.
> Cabin Crew: Flight attendants serve multiple purposes including ensuring the safety of passengers and providing service such as meals. The average ratio of flight attendants to passengers on long-haul flights is around 1:50. Therefore, for 500 passengers, we would require about 10 human-equivalent flight attendants. Assuming the same ratio of giraffe brains to human brains, this would mean we need about 720 giraffes (72 giraffes equivalent to 1 human multiplied by 10).
> Airport Staff: This could include a wide range of roles such as baggage handlers, check-in staff, security personnel, air traffic controllers, etc. It's difficult to estimate the exact number as it can vary widely depending on the size and busyness of the airport. However, as an approximation, let's say a small airport requires a staff of 100 human equivalents. This would mean approximately 7200 giraffes.
> Design of the Aircraft and Airport: The design of the aircraft and airport would need to be significantly altered to accommodate giraffes, considering their size, physical abilities, and requirements. The size of the cockpit and cabin would need to increase substantially to allow for multiple giraffes. The controls would also need to be adapted to be usable by giraffes, likely involving some sort of body harness or direct neural interface given giraffes lack opposable thumbs or similar gripping ability. The height of the plane and airport infrastructure would also need to be increased due to the giraffes' tall stature. These changes would certainly increase the weight and fuel requirements of the plane.
> Adding the numbers up, we get a rough total of about 8064 giraffes to run this operation (144 for pilots, 720 for cabin crew, and 7200 for airport staff). Please note, however, that these are extremely rough estimates and in reality, the number could be significantly different based on the many factors and variables involved in such a hypothetical situation.
> In this scenario, the number of giraffes needed is estimated based on an equivalence of the number of neurons in giraffe brains to those in human brains. However, it's critical to remember that this is a simplified way of comparing cognitive abilities. In reality, intelligence and cognitive function are not solely determined by the number of neurons, and they certainly don't scale linearly. The organization and connections within the brain, the type of neurons, and the role of other brain cells are also critical. This serves as an imaginative exploration rather than a scientifically or practically accurate prediction.
I mean, it's not perfect but it's clearly formulating something there.
I'm going to bet that if you do some searching, you can turn up some text on the internet where someone was asking "how many of x animal would it take to y human task", and answered the question using a neuron comparison. It's not inventing this approach from scratch (because it can't). Every word in that answer is there because statistically, based on the text in its database, it was the most likely word to occur in that context. Hell look at the way it chose to describe flight attendants, you can see it grabbed that directly from some snippet somewhere.
I think a lot of the misconceptions about these LLMs (which I actually think are pretty cool and useful) stem from not really understanding what they are doing and how they arrive at their answers. This is pretty well illustrated by the fact that ChatpGTP only does well on tests where the content is well-represented in its training dataset, and terrible on tests on otherwise. But you have to keep in mind that the training dataset is enormous, and likely includes every random forum or reddit post ever made, so you shouldn't be surprised that it can formulate an answer like this.
> I think a lot of the misconceptions about these LLMs (which I actually think are pretty cool and useful) stem from not really understanding what they are doing and how they arrive at their answers.
I think the misconception is actually the opposite - focussing too much on the method rather than the output. Bread is not just yeasty-wheat - if you mix bread and wheat together and apply some heat you get something a little surprising. I think the same is true for LLM's - in the task of training a next-word guesser there are emergent capabilities which end up going beyond what people expect when looking at the method.
Yes, it is designed to statistically guess the next word, but in order to do that the LLM has (surprisingly) gained internal representations of things like 'what is a giraffe'. In order to do the task of 'what is the next word' most effectively, it has had to understand/learn about about the world.
Can you provide an example where you think GPT-4 performs badly that proves your point? (i.e. something not in the training corpus that I could probably answer but a leading LLM could not)
What exactly in that reply makes you think that GTP-4 has an internal representation of "what is a giraffe"? You basically fed it the context in which to answer the question in your prompt, by asking it to consider a world were giraffe brains could be networked. The creative bit came from the human, the LLM model is a recitation of factual information that is present in the training data.
Jesting Pilot's question is one, for a start. "Quid est veritas" can only be answered by one who sees.
That’s not a specific example question I can answer… and if it is I don’t understand it?
If the question is just “what is truth?” then that’s something that’s something that’s pretty heavily in the training corpus and also something an LLM can pretty easily define.
Does it define it correctly? Can it tell? Can you?
Well GPT responds by stating that the concept of truth is a complex and ambiguous philosophical concept, and that there are multiple definitions depending on the context (e.g. mathematical truth is different to scientific truth which is different to pragmatic/constructionist philosophical theories of truth etc).
It provides a much better, more balanced and more refined definition of truth that I can manage...
"Can it tell?" implies sentience - which I don't believe is required for AGI.
[flagged]
> I wish people would just stop using the term AGI, because they are not able to use it in a precise way.
But we can't define AGI in a precise way because we don't know exactly what it is. If we did know exactly what we meant by AGI then I suspect we'd be able to reproduce it.
AGI is a fuzzy concept and I don't think that's a problem, because being able to talk about something in a fuzzy way is better than nothing.
AGI is a term like life. We have no rigorous definition that we all agree on but we probably would know it when we see it.
You are right - AGI does not need consciousness, and consciousness might not require general intelligence (a mouse is conscious but GPT-4 will beat a mouse at most 'general' intelligence tests).
> LLMs also do not have a stream of subconscious experience
I think it is a stretch to claim this without evidence.
LLMs that are not being trained on the inputs do not have long-term memory though.
It is non-obvious to me that GPT-4 is reasoning in any way. It seems to just be a huge Markov chain.
> Enter Yale School of Management economics professor Jason Abaluck, who in May took to Twitter to proclaim: "If you don't agree that AGI is coming soon, you need to explain why your views are more informed than expert AI researchers."
So, aggressive, accusative, appealing to authority, likely cherry picking and (it seems) misunderstanding those same authorities, all for the purpose of brow-beating strangers. I appreciate that the author of the linked article wrote a reasonable response to this, my own would have been three words starting with "go" and ending in "yourself".
It's far more hopeful than it's been before too. We're asking ourselves what the consequences will be, which we have the time to do. But where do you go with those questions? I think the right answer is don't freak out, but plan what you can.
Are we more hopeful than people were in the 1960s?
More knowledgeable, with more to work with.
Despite very real and impressive progress with LLMs towards general conversation agents there has been near zero progress in understanding and recreating consciousness and agency which even insects posses.
I want to throw something out here:
Insects, US, animals, all have two things in common, and thinking about them may prove very useful in this context:
(1) - Ongoing stream of what I will call input experience. What we see, feel, hear, etc...
(2) - Internal input experience sufficient for the entity to operate and as part of that operation, differentiate itself from everything else.
Number one is easy.
Number two may be important to understand where consciousness comes from. And yes, I am saying consciousness may be an emergent thing that will emerge on some level given the entity is sophisticated enough to understand itself.
Think about your own body. We have nerves and can and do get input about how that body is interacting with literally everything else!
From that comes a very basic "self"
ChatGPT is something in a box! It has no real sense. It cannot know it's own self. There simply is no context, no data it can use to understand.
And it has no real memory. Say the sense data was there. It would have to be prompted in every time.
Here is what I am getting at:
Say we give ChatGPT the robust and complex sensations needed for it to differentiate itself from everything else?
And then we put it on a loop so it can take input from us in that context. And it can remember.
Suddenly, asking it a question interrupts it's stream of experiences!
And see where that could lead?
Consciousness (another term that is too vague) sure. LLMs do not have a high-bandwidth stream of subjective experience at all similar to animals.
But agency? You just give the LLM an open-ended goal and access to manipulate the environment and run in a loop. Many examples of this.
Of course its not the same type of agency that animals have, doesn't have self-preservation instincts etc., but there are enough movies and TV shows demonstrating why that may be a pretty bad idea anyway.
Consciousness has not been objectively proven. As far as we know it could just be something philosophers made up.
For the sake of discussion, what do you __mean__ by consciousness?
that's the point. Nobody even has a good enough definition of consciousness and intelligence.
I see GPT as a primitive AGI. Why not?
I guess everyone actually means ASI when they say AGI?
Yeah. Also, in my estimation, AI doing some tasks for cheaper than humans is the real margin/frontier of disruption here. That frontier is only going to expand from here unless we have a particularly bad economic collapse…
The need to appear contrarian, dismissing and stroking the normalcy bias of the population is strong. Denial is the noun that describes these articles.
Crafted by Rajat
Source Code