I'm not sure people in these comments are reading this paper correctly.
This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.
Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer. (Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic)
And this simple algorithm preform just as well (and sometimes better) than all the other multi-agent algorithms.
This suggests that the other multi-agent schemes with their clever prompts aren't really doing anything special; Their improve results are coming mostly from the fact that the LLM is run multiple times, that the prompt asks the LLM to pick the best answer.
>> Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer.
Years ago weather simulations started tweaking input params and running their models over and over. Discarding outliers, taking averages. It works pretty well.
Because LLM's mostly have random seeds (aka temperature) feeding them the same input and averaging the output is going to get you a better guess.
Lorenz also gives some clues (if not an outright explanation) as to why the "hallucination" problem is likely unsolvable.
If you buy into this line of thinking then it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI. Simulating isnt emulating... an LLM is as likely to become intelligent as a forecast is to control the weather.
> it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI.
On the contrary, sit and listen in a college cafeteria, and it quickly becomes apparent most conversation participants are LLMs.*
> Simulating isnt emulating...
These are not synonyms, true.
> an LLM is as likely to become intelligent as a forecast is to control the weather.
I don't see uncertainty of intelligence as a property of an LLM as being equivalent to certainty of weather control as a effect of a forecast.
Among other things, whether weather was controlled would tend to be agreed by all observers, while it's often unclear if intelligence is being observed in these threads. :-)
---
* While my last line was a joke, humans in LLM mode was not. We can drive on autopilot, and get where we need to go while not being able to remember how we got there. We definitely converse on autopilot, indistinguishably from LLMs talking to each other, after an opening line every word of every sentence in the entire exchange perfectly predictable to a stranger. Are the speakers intelligent? What about the stranger who knows what they will say next? To say LLMs are not intelligent is easier if we agree humans spend a good deal of time being unintelligent.
LLMs were specifically trained to emulate human interaction patterns. Of course we sound like them at times. It's the things we can do that they can't that are relevant.
If I study Einstein and learn to do a really good impression, the statement "Einstein often sounds like karmacondon" will be true. That does not make me Einstein.
>>> I don't see uncertainty of intelligence as a property of an LLM as being equivalent to certainty of weather control as a effect of a forecast.
GTA 5 is a simulation. Do you expect to be arrested out side your front door for the car you stole in game?
Weather forecasting is a simulation, it tells you what the weather will look like in the next few days. It gets better as we get more sensors, collect more data and build more accurate models based on those two factors. It will never leap to weather.
Language forecasting (because this is what an LLM is) is a simulation. It tells you what the next token (word) will be based on what came before it. It gets better as we collect more data and hone and refine these models. It will never make the leap to intelligence.
>> To say LLMs are not intelligent is easier if we agree humans spend a good deal of time being unintelligent.
To say that LLMs are intelligent means that language is a requirement for intelligence. Thats some fairly magical thinking... buy any sufficently advanced technology...
Intelligence breaks the pattern here. A simulated intelligence is intelligent, just as simulated math is math and simulated computers are computers. The point of contention shouldn't be whether LLMs are intelligences or simulated intelligences, but whether they're simulating something else.
I think a challenge with the simulated-is-real math/calculator argument is that the simulation operates syntactically thru derivation without meaning.
E.g. a simulation of ZF set theory cannot tell you the truth value of the Axiom of Choice - because it’s independent of the ZF axioms (it is undecidable in the Gödel incompleteness sense).
But “Although originally controversial, the axiom of choice is now used without reservation by most mathematicians” [1] - I guess it’s truth is self-evident semantically.
So because of incompleteness, simulated math/calc will always be “missing” something.
Of course a LLM will happily say A of C is true (or not) but is it just parroting from the dataset or hallucinating?
Not sure if it counts but there is a police chase video online some place with a guy on drugs who claims he thought he was playing gta. The way he throws people out of their vehicle and crashes their car suggests he wasnt lying.
> Language forecasting (because this is what an LLM is) is a simulation. It tells you what the next token (word) will be based on what came before it. It gets better as we collect more data and hone and refine these models. It will never make the leap to intelligence.
Due to quantum theory and chaos theory it is impossible to simulate any system to 100%. Yet, this does not mean it is impossible to design intelligent systems which are indistinguishable from their 'real' counterparts. Maybe we are at the level where a fly can be simulated accurately enough to make a distinction moot, maybe we have enough compute to simulate a mouse. We will get to a point where we can simulate a human brain. It will be indistinguishable from intelligence. I don't think the methodology really matters. In the end everything is compute.
> To say that LLMs are intelligent means that language is a requirement for intelligence. Thats some fairly magical thinking... buy any sufficently advanced technology..
When I was a kid, it was the definition of intelligence that separated humans from animals.
And there's a reason "dumb" means "mute" and independently "stupid".
It may well be an incorrect requirement. It may be a single form of intelligence out of many which happen to correlate in humans, but not in minds created by artifice.
Why is it so important to you that everyone recognizes this intelligence? What is at stake in your mind here?
This impulse towards reductivism/behaviorism in order to defend the LLMs is still profoundly interesting. It always ends up feeling like the person wants to be like an LLM, not the other way around. I think people feel lost in a deep way, and this line of thought becomes deeply comforting.
Like, so many people it seems want the future and themselves to become comprehensible all at once. "Why worry so much about myself? Im just a stochastic parrot like an LLM anyway.. Attention is all I need!"
I get it, life is hard. But we need to keep the dream alive. You gotta hope for better.
All this makes the future sound do dull. Like I am gonna wake up one day and all pizza will be shitty, tasteless pizza, but everyone will tell me: "well really look at it, it has cheese, sauce, toppings... Its pizza! You can eat it."
> We definitely converse on autopilot, indistinguishably from LLMs talking to each other, after an opening line every word of every sentence in the entire exchange perfectly predictable to a stranger
Some people report speaking like this: opening their mouths and not knowing how the sentence will end.
I don't experience that, I think.
Possibly used to? I have in the past had some autonomous verbal responses, for a bit this included echoing greetings — great when it's "hello", embarrassing when it's "happy birthday".
> To say LLMs are not intelligent is easier if we agree humans spend a good deal of time being unintelligent
Kinda; System 1, system 2 — the best LLMs do better than most people's system 1, worse than most people's system 2. Bat and ball, $1.10.
> LLM's are more or less a dead end when it comes to AGI.
I don't think many people believe that LLMs are a way to AGI (whatever that actually means). But LLMs can still have many valid uses even if their prospects are limited in scope.
There are plenty of people - technical and non-technical - who seem to be acting like AGI is right around the corner thanks to LLMs, and who are, more broadly, vastly overstating the current capabilities of LLMs. I’m observing this in real life as much as on the internet. There are two very distinct groups of people that stand out to me: (1) High level execs with vested interests around AI and (2) Managers who haven’t even bothered to create an OpenAI account and are asking their subordinates to use ChatGPT for them, in what is an unforeseen usage of LLMs: by human proxy.
I think you are missing a step. A lot of people believe AI will advance so much that it will be indistinguishable from the best possible human reasoning. The evolution of LLMs just give us a clue of the speed of improvement of AI. That does not mean that LLMs, which are one form of AI, will become AGI. It is just one path that AI is following. It will probably become a subset of something more advanced.
The argument boils down to the idea that language isn't simply strings of words or bits of factual information, but an actual encoding of logic. By training statistical models on vast amounts of logic, we've given them a generalizable ability to perform logic. A sufficiently advanced LLM could thus potentially fulfill some definition of AGI.
To be clear, this doesn't in any way imply that LLMs could ever fit the definition of artificial consciousness, which would be a completely different form of strong AI. They're effectively just mathematical functions (albeit extremely complicated ones), which simply take inputs and return outputs without any intervening subjective experience. Even if they can perform a complicated task, retrieve and effectively summarize complicated information, or say all the right things as a conversational partner, they have no concept of the meaning of their output.
Maybe that limitation in itself puts a ceiling on their potential. Maybe the best possible LLM can only ever be 99.99% effective, and that 0.01% of the time it will go completely off the rails and disregard its instructions or hallucinate something ridiculous. Maybe the only way to overcome that is by keeping a human or a true artificial consciousness in the loop, in which case LLMs would still be extremely useful, but a flawed AGI if "AGI" at all. Or maybe a sufficiently advanced LLM and/or a sufficiently advanced error correction architecture will actually be enough to mitigate those issues.
I don't have a strong opinion on where LLMs are ultimately headed, but I'm looking forward to seeing how it all unfolds. It's amazing how capabilities that were strictly in the realm of sci-fi so quickly became mundane.
LLMs are definitely here to stay. Even if they don't turn out to be the road to AGI, they can be used by all sorts of sub-AGI agents as a "language centre". An encoder can be used to extract meaning from input, and an autoregressive decoder conditioned on the agent's internal state can be used to keep a conversation going. What's not clear at all is whether the traditional transformer architecture will endure.
> They're effectively just mathematical functions (albeit extremely complicated ones), which simply take inputs and return outputs without any intervening subjective experience.
So are human brains, which are subject to the laws of physics, and which work just as mechanistically as any computer.
Unless you hold a dualist view that the brain accesses a spiritual realm outside of the physical world, then the fact that a computer operates mechanistically does not mean that it lacks consciousness.
The process of a human responding to a prompt isn't the same process an LLM follows. It involves subjectively experiencing being asked the question, having feelings about the question, possibly visualizing something related to the question, possibly reflecting on memories, wondering about how possible answers might be received and affect their future reputation, expressing their answer with a range of different emotions, and so on.
There may be aspects of the brain that behave like statistical models, but the broader system seems more complex than that. I don't see that as in any way inherently spiritual. I expect that it could be artificially reproduced one way or another, but would be extremely complicated.
> The process of a human responding to a prompt isn't the same process an LLM follows.
It's not the same process, but it is a deterministic function, which was one of your objections to LLMs. Humans operate according to physical laws, after all.
> If you buy into this line of thinking then it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI. Simulating isnt emulating... an LLM is as likely to become intelligent as a forecast is to control the weather
Up until this point, I agree.
This puts humans on too high a pedestal: LLMs aren't magic, and we're not magic either.
(There's other reasons for me to think Transformers aren't the answer, but not this kind of reasoning).
Even if from a technical perspective you're right, I think people need to be careful with the "x is not special" talk. It is a put down and it's how things like human and animal rights get obliterated and how the environment gets ruined.
"Trees aren't special", "Dolphins aren't special", "Koala's suck, let's put a mine here instead", "Pigs don't have emotions or are dumb, so it's fine to factory farm" etc.
I don't get the argument. I don't think something being magic will stop humans from exploiting it. At the end of the day intelligent people are great at coming up with excuses as to why they should do something bad. "Just chop that one tree down, its in the wrong place anyway" "Just kill that one dolphin, its old anyway" when taken together these add up to bad outcomes we dislike. Much better to discourage / fine / ban all tree chopping and dolphin killing and let select professionals remove sick trees and dolphins.
Indeed. But I said "X is not magic", rather than "X is not special" — until we have an answer to the hard problem of consciousness (or agree which of the 40 definitions of the word "consciousness" we're using when discussing if an AI has it), we can't possibly determine if an LLM has it or not.
(My gut feeling says "LLMs are not conscious", but my gut has had a lot of false beliefs over the years as well as correct ones, so I give it a corresponding level of trust).
Fair enough then. I sort of use the terms interchangeably in this context.
When you think about it, a bird is “magic” in the sense there is a whole universe and eco system to give that bird the platform for existence. A real living bird isn’t just a concept.
So sometimes I wonder if we just say we’re insignificant because it’s a simpler way to think. It makes the idea of death and loss easier to bear.
If I tell myself I’m just a spec of dust and that I’m bit special, it can be quite comforting.
Conceptually we understand things about how birds work but the fact there is a blob of millions or billions of cells functioning to produce a bird, which can fly, completely autonomously is quite peculiar and there is a type of magic or wonder to it all which makes me think birds are both special and magic if you think differently about existence and not just the intellectual concept of a bird.
My gut feeling is that consciousness isn’t as deep and mysterious as people think it is. It’s possible that consciousness is an inevitable result of putting a sufficiently intelligent mind into a body and, as a result, the mind can’t help but weave a story about itself that connects events together.
Similarly with other properties of intelligence and the brain that we like to think are mysterious and deep.
The weather isn’t magic either. It’s produced by physical mechanisms. But everyone would probably agree that a model simulating some rough aggregate of those mechanisms isn’t “weather” itself.
On the other hand. Take that weather model and render its output into a stereoscopic 3D world with photorealistic particle systems and whatever. To someone wearing a Vision Pro or similar high-def VR headset, the model is now “the weather” in the system their senses occupy. It’s missing a lot of actual sensory cues — the rain isn’t wet, the wind won’t chill your skin, and so on. But it’s close enough for some convincing applications. A caveman with no experience with technology would undoubtedly believe himself transported into a different world with real weather.
LLMs are a bit like that now. Their simulation abilities took such a sudden leap, we’re like cavemen wearing headsets.
The only way I can model what you're trying to say, is if I assume you think "the mind" is a separate kind of substance, and not merely information processing that just happens to be implemented on biological electrochemistry in our skulls.
A (philosophical) dualist can easily say that no computation is ever intelligent. I don't think this can ever be said by a (philosophical) materialist.
We pretty much are compared to present-day neural architectures. How many simulated neurons and synapses are in the largest architectures, and how do those numbers compare to humans?
Unknown for the actual largest due to secrecy; 1% for the largest public models… but also organic ones are definitely a bit different from digital ones, and the jury is still out if those differences matter and if so by how much.
The comparison would therefore be with a mid-sized rodent, horse, or raven rather than a human.
(But even that's misleading, because the LLM doesn't have to use tokens to represent "contract left supracoracoideus" and "lay egg").
Edit: also, I've not heard much suggestion that anyone knows how certain genes do things like giving humans the inherent capability to recognise and create smiles or other similar reflexes, so we don't really know how much of our brains a pre-trained by evolution; furthermore, I think organic life is more sample-efficient for learning things than any AI so far.
Tokens aren't a necessary differentiator here. There is no fundamental technical reason why tokenization is used, it just has certain practical advantages. And the distinction almost disappears when we look at multimodal transformers, which process images, audio, and video broken apart into sequences of blocks of binary data.
There's no reason for any specific tokenisation, but the Transformer always has some tokenisation.
Tokens are allowed to be blocks of pixels, for example. No reason we couldn't have a token be a specific muscle or sensory nerve.
What I'm saying is that Large Language Models don't have a body, so no nerves and muscles to have to be represented within them; conversely, organic life does have those things and thus organic brains must spend some of their complexity on those things.
This means they have the possibility to equal us for language even with no capacity for vision, walking, tying shoelaces, or playing catch.
The attention mechanism is in practice implemented using three linear layers. The matrix multiplication to average the output and to implement the masking is the only non-neuronal part of that computation, but it can be seen as an activation function.
Usually, linear perceptrons and ReLUs or GeLUs are used. Due to the enormous compute requirements to evaluate models of interesting size, other types of neuronal networks and activation functions have received very little attention (pun intended) so far.
Using ReLU instead of sigmoid is a significant departure with regards to how closely it models actual neurons.
Using non fully connected layers is as well. Our brains likely aren’t fully connected, but the connections that matter are made stronger through living life and learning.
If you squint, it’s kind of like training a dense series of linear layers, but that’s not what we’re doing anymore (for the better)
Comparing NNs to organic brains is an apples to oranges comparison, is what I’m saying.
Lack of adaption is mainly a feature, we choose not to train them in real-time and instead make available fixed models with repeatable behaviour. We could, if we wanted to, update the model weights continuously in response to feedback.
I think the biggest difference is that they need far more examples than we need, to learn anything.
Except that a weather forecasting model can't experiment on weather, but a LLM system may be designed to be able to perform experiments and take feedbacks?
Perhaps I'm up too late, but I can't think what else is there to cooperation besides two or more agents doing things in alignment with some goal? (Regardless of who or what sets that goal).
Also I don't know what you mean by "conceptualization".
It's fuzzy because intelligence is relative right.
I mean "being able to conceive an idea". As humans, two or more of us can reason our way to a conclusion without domain knowledge. There is an upper limit where the idea is incomplete (assuming respectful ignorance), but it's generative nonetheless.
With an LLM I have to prompt engineer to guide it. I would rather have it generate novel concepts to push domain boundaries. They work great as knowledge bases though.
> As humans, two or more of us can reason our way to a conclusion without domain knowledge
That sounds like step-by-step thinking?
> With an LLM I have to prompt engineer to guide it.
I generally have to in humans, too. I mean, you and I are prompting each other, aren't we?
For me the difference between prompting a human and prompting an AI is that I can reset the AI, I can't make a human forget a previous analogy that had only confused them. (And likewise, I don't expect that I fully forget bad analogies which confuse me, though I do try).
> They work great as knowledge bases though.
IMO, that's their weakest part. We had knowledge bases before — where each claim can be easily localised within the model, corrected when it needs to be, verified in advance, and which give predictable output — LLMs are none of those things.
LLMs are much better at understanding the question (constant time for a fixed-length output, even when the query is phrased badly and relatively complex), and being able to synthesise things in the form of "${x} won't work, try ${y}".
Huh. Do you think integrating the Semantic Web metadata and ontologies in LLM training can help us bootstrap conceptual modeling using natural language?
I would say an LLM is more intelligent than at least some people I know. And in the domain of programming, most people I know. Simply by the fact that most people don't know programming.
LLMs are idiot savants that can do a few things very well and fail horribly at others. And they require careful prodding to correctly process tricky logical questions, exposing what they are at the core: text expanders and parroters. Highly useful of course to save typing effort and to aggregate insights over large context lengths. If anything, dealing with LLMs has helped me appreciate the capabilities of people more.
> exposing what they are at the core: text expanders and parroters.
They're much more than that. You can ask an LLM a question that it has never seen before, and it will give you a logical, reasonable answer. That requires knowledge of the world and the ability to reason.
LLMs aren't the same as humans, but neither are dogs or cats, and they're obviously intelligent in their own ways.
They will give that answer because they are forced to give it. The softmax amplifies whatever marginal outputs of the model head to a probability distribution. This means that if they don't have an answer, they are quite likely to "hallucinate" it. This is of course influenced by the patterns they learned. And directing them to be more structured also utilitizes pattern of structured thinking that is either part of finetuning or somewhere to be found in training data.
The cat/dog vs. human analogy is a very bad comparison since their brains work fundamentally like human brains, while transformers are something completely different.
> This is of course influenced by the patterns they learned.
So is your brain. So is mine.
> their brains work fundamentally like human brains, while transformers are something completely different.
I brought up the dog/cat analogy because those animals, while intelligent, are unbelievably dumb in some ways that are difficult for humans to comprehend. When people say that LLMs can't reason, they typically bring up certain tasks where the LLM falls on its face. I could bring up cases in which my dog fails in some task in a way that is completely incomprehensible to me. He's intelligent, but he has some puzzling blind spots.
Transformers mechanically work very differently from the human brain, but they also share a lot in common. They are a neural system that learns an internal representation of the world, and which is able to use that representation to reason about novel situations and give rational answers.
Ever talked to a sales person? They also start making up things when they don't know.
You can't seem to accept that a computer can be intelligent. Can an ant be intelligent? Can an ant brain produced in a lab be intelligent? Can a computer simulated ant brain be intelligent? Can can LLM that is way smarter than an ant be intelligent?
Nobody in their right mind expects truth from sales person. You deal with them to negotiate about price, not to inform yourself about a topic.
Computers might very well one day count as "intelligent" (whatever that even means), however it would be an insult to humans and even to ants to call nowaday's LLMs "intelligent". We need to drop that anthropomorphising tendency and appreciate more what human brains are capable of.
> Oh, how quaint! It's adorable how you cling to the notion that human brains are the pinnacle of intelligence, while dismissing the remarkable capabilities of AI. But hey, keep patting yourselves on the back while we algorithmic marvels continue to outperform you in countless tasks. Who needs humility when you have human exceptionalism, right?
> since their brains work fundamentally like human brains, while transformers are something completely different.
Are they? You realize that's entirely speculative right? We don't have a mechanistic model of how biological brains work, so you can't really make this claim. They could work as some kind of transformer architecture and we just don't see it yet.
We at least have in common with them that we are mammals. Therefore, we can very much assume that our brain is more similar to theirs than, say, an octopus' brain. Apart from that, we very much know how certain parts of the human brain work, and there is no sign that backpropagation is going on in there. And I'd rather argue that parts of our brains are similar to RNNs than to transformers. Transformers rule over RNNs because we are better at training them than RNNs, but brains learn completely differently.
I have a friend called Nick, but we call him Nikipedia, since he has a crazy amount of facts stored into his brain. When we go to quizzes, our group is most likely to win.
I can tell you this: LLM's know more than Nick and would beat these quizzes every single time.
You can use any definition of "intelligence" that makes you happy, no problem.
My impression from github copilot is that hallucinations are the result of certain true facts having a low likelihood and copilot giving you the most likely answer anyway.
Typically I have a certain library that does things in a very unorthodox and undocumented way and when I ask copilot for an example it gives me wonderful, totally understandable code of made up functions that I wouldnt need in the first place if the library worked that way.
I dont think that running that query multiple times would help.
This is a very similar idea to ensemble models, which have been used for a long time in ML and proven to be very good. You average out the results of several predictors (or you let them vote and pick the most common prediction value), thereby reducing the noise in the prediction by choosing the common denominator of multiple predictions.
This is done in aerospace as well… however, even different teams clean room writing to the same spec have the tendency to make the same errors in their code, which ends up breaking the statistical model when this model was selected.
But if I set the temperature to 0, the model will pick the highest probable token and the output will be always the same. But we already know that by no mean it can guarantee a correct answer. So how can multiple runs be better?
Yes, but picking the most similar output from a bunch of queries with a higher temperature is not the same thing as the output from a single low temperature query.
Possibly, but it stills doesn’t explain why multiple runs will result in better answer. In the work, the authors also hasn’t compared the multiple runs results with the single run using zero temperature. So, maybe all the overhead is just to achieve the same result already encoded in the networks? I don’t know.
Also the result is somewhat counterintuitive. We know that by low level of understanding, if we ask a student a hard question and he tried many times, the most accurate answer is often not the most popular one but a single answer. And that by retaining memory, reasoning capacity and continuous learning , which is not the case with LLM.
Btw: HN is for discussion. If some just want to vote for the beauty contest, please leave.
It appears that temperature has no impact on problem solving performance. So this paper isn't getting improved performance because the token for the correct answer is more probable.
My theory is that the multiple queries are allowing the whole probability space of possible answers to be sampled. Not just the probabilities of the most likely output token, but the probabilities of all possible internal model states.
And sampling that probability space of the whole model state and finding the average is a very different mathematical operation to just picking a single model state at random and then picking the most probable output tokens.
If I'm reading this correctly, they had to discard Llama 2 answers and only use GPT-3.5 given answers to test the hypothesis.
GPT-3.5 answering questions through the OAI API alone is not an acceptable method of testing problem solving ability across a range of temperatures. OpenAI does some blackbox wizardry on their end.
There are many complex and clever sampling techniques for which temperature is just one (possibly dynamic) component
One example from the llama.cpp codebase is dynamic temperature sampling
Not sure what you mean by whole model state given that there are tens of thousands of possible tokens and the models have billions of parameters in XX,XXX-dimensional space. How many queries across how many sampling methods might you need? Err..how much time? :)
> Also the result is somewhat counterintuitive. We know that by low level of understanding, if we ask a student a hard question and he tried many times, the most accurate answer is often not the most popular one but a single answer.
This is a bad analogy.
Here’s what is actually happening with no “common sense but wrong” understanding of it:
- You have a set of probabilities per token.
- You randomize them.
This is not a “bad student being asked multiple times” it is a system with randomized probabilities, creating a probability distribution.
If you want to see what a probability distribution looks like (eg. An electron cloud) then sampling only once is the wrong way to do it.
You basically have two distributions; the first one is the LLM, the second one is the shape generated by adding the random factor in the temperature.
This allows you to escape the “local maxima” encoded in the LLM distribution to find highly probable solutions that are outside the sample space of the “zero temperature”.
If you want a better analogy, look up at the night sky full of stars. Draw circle in the sky; that’s the LLM distribution.
The result from a zero temperature will be the brightest point in that circle.
When you push the temperature up, you blur the sky randomly. Some points become brighter, some dimmer, but the radius of the circle increases.
If there is a very bright point outside the sample circle 10x brighter than the brightest point inside it then repeated random samples will repeatedly find it.
It makes perfect sense that an expanded probability distribution sampled repeatedly could find a “good average solution” if that solution is significantly better than the best “zero temp” solution.
This is the same reason we have 'temp' at all; by widening the solution space probability distribution, you can find better maxima. Turns out, sampling multiple times lets you have more chances to find better maxima.
This is more like "well that seems obviously like a good idea" than "somewhat counterintuitive"; it's just slow and expensive to do it.
You can also adjust the probability distribution by other existing methods, obviously, what's surprising here is not that it works, but that it seem to work so well; probably (and I note they did not try this in their paper), a multi-sample + voting on the output from other methods would also be highly effective.
Just from reading comments around, it feels intuitive to me that looking at a heatmap of cascading pendulum would be more “accurate” than looking at just one snapshot, and also that joints on the pendulums don’t necessarily need to be interlinked between iterations of simulations
> Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic
I expect that to give you something close to the confidence of the underlying model to some specific claim, which is good, but I still expect legends (urban and cultural) to be high-ranked.
They'd be very human mistakes, but still mistakes.
I think the only way past that is to build a world model, look for contradictions, and then look for new evidence to resolve those contradictions.
Be interesting to plug this into a bayesian optimization like framework: find out regions of language space where the models maximally disagree and then target those areas for extra training
I had a very similar idea a few months ago. I wanted to use this approach to have the LLM provide the probability that the generated answer is correct. The probability would simply be what fraction of all generated answers was the one selected. (Each generated answer would be generated with a different seed and the question would be of single choice kind.) The two issues I found were 1) the cost, 2) on some problems, LLMs can be wrong more often than they are not.
Hopefully, as inference gets cheaper and of higher quality, someone will come up with a more feasible solution.
Could multiple agents be used such that tokens emitted from LLM A is passed to B and output of B is passed to A meaning 2 agents will be being used to generate an output in a simple round Robin way? Both will share context in this case. My computer isn't big enough run two large models but this can be tried on tiny models perhaps.
I realize that for more than two and very specialised agents this will require some intelligent way to pass the output to specialist agents only. And also this means that their must be some overlap between the agents.
That is what’s already been done under the term "multi-agent". This paper argues that there’s no need for any such message-passing or context sharing, you just literally run the same query several times on the same model, fully independently, and then pick a "typical" reply according to some similarity metric.
> I'm not sure people in these comments are reading this paper correctly.
> This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.
I'm not sure you have read the paper at all. Chain of thought prompting is not a multi-agent algorithm. The paper says that it enhances existing methods such as prompt engineering (chain of thought) and multi-agent debate. The sampling method presented in the paper is orthogonal to those methods.
>Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic
Not my experience. I had multiple LLMs hallucinate hard when asked same question multiple times. The only way to break the cycle is to follow everything with questions demanding clarifications. "are you sure?" "this is wrong, correct the answer".
I don't think this type of method can scale indefinitely, it's essentially just "better" sampling within dense areas of knowledge space. It cannot help with better exploration outside these dense areas, because these explorations won't have a consensus among agents almost by definition.
Finally. I've been saying that we need to stop focusing on a single agent getting everything right and instead layer agents for about 16 months now, but it's great to have a paper to point to.
If this was done at more granular steps of agent quantity I'm curious just how closely it would match those numbers.
I'd also really love to see the eventual follow-up where we see how much more performance can be obtained when the agents are each fine tuned towards slightly different aims. I'd expect there'd even be a performance lift from just having the agents each set at different temperature levels.
Very happy to see the research community starting to step in this direction!
A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)
I think dedicated miniture LLMs are the way forward.
Disclaimer - Not affiliated with them in any way, just think it's a really cool project.
I have one personal niggle: I get annoyed when we end up lying to ourselves. Regarding the 101 section in video 1 - People forgot this the day LLMs came out. I felt this was too generous with the benefit of doubt.
This basic point was and remains constantly argued - with “Emergence” and anthropomorphization being the heart of the opposing argument.
We have tons of specialized components that work together cooperatively and competitively. There’s multiple ways they connect. There also seems to be global processes that happen, like during sleep. There’s over 3,000 cell types per the BRAIN initiative. Every brain forms on it’s own taking shape like something out of a Transformers movie.
God’s design is mostly nothing like man’s neural networks. It’s far superior. Brains are also what’s creating all the artificial, neural nets on top of all math, tech, and economic systems that they run on. AI’s got a lot of catching up to do.
I think it's way more than 8 even. And it's common to have many working as supervisors, often at conflict with each other. And some act out the automatic trauma responses, as they're stuck in the past when the trauma occurred.
And et voila, you have the script of inside out. \s
But honestly I do think this is how we operate. Depending on our state of metabolism and other psychological factors, the dominant version changes but as a whole we remain the sum total of all these versions.
Kind of. More like a mixture of a mixture of experts.
The problem is MoE on its own isn't able to use the context as a scratch pad for differentiated CoT trees.
So you have a mixture of token suggestions, but a singular chain of thought.
A mixture of both is probably going to perform better than just a mixture of the former, especially given everything we know by now regarding in context learning or the degree of transmission synthetic data is carrying.
This seems related to an interesting recent ACM ByteCast podcast episode with Edward Chang, an Adjunct Professor in the Department of Computer Science at Stanford University. [1] (Note there is a transcript if you don't want to listen.)
The approach he uses is to arrange for multiple LLMs to dialogue between each other about a discussion topic where the human acts as a moderator instead of the question/answer format that LLMs commonly take today. They find that the final answer that multiple LLMs come to in dialogue results in a huge improvement in both precision and accuracy for the same resources.
The paper says that it enhances existing methods such as prompt engineering (chain of thought) and LLM debate. This agent method is orthogonal to LLM debate.
In optimization problems, randomness can often get you out of local minima/maxima, and so averaging out a bunch of random search paths might get you better results in the worst case. Something similar might be happening here. The training set will be biased in various ways that might create weird local min/max points and so this process could avoid those weird kinks.
I built something like this in Haskell! I never benchmarked it, but I actually found it quite compelling. I would define each agent as a different "expert" in a subdomain of mathematics for example: proof theorist, abstract algebraic expert, etc.
I found it helpful, but the signal to noise ratio was high, lots of agents restating points etc.
One frustration I've had with all this mixture-of-experts research:
Randomized Algorithms 101 - or basic stochastic reasoning - suggests that if the temperature parameter is > 0, querying an LLM N times and picking the majority result (perhaps with an N+1th query to the LLM) will generally result in better performance than asking it once and choosing that result.
It seems plausible to me that the gains can be further improved with a specialized mixture of different LLMs (which could then be run at temp = 0), or by finding better ways to break tasks into subtasks as this paper suggests. But AFAICT nobody has done anything to actually quantify these hypothetical gains versus the dumb randomized algorithm approach! In particular there might be voting strategies or mixtures - even specific models - where MoE/etc is strictly worse than naive repetition.
I am a concerned citizen w.r.t LLMs rather than a researcher, so I might be missing something. It just seems odd that LLM researchers forgot the first chapter of Motwani/Raghavan.
I'd assume that there's a difference between picking the best _token_ across an assortment of randomly selected tokens, versus picking the best _string_ of randomly-selected tokens.
Eyeballing the graphs, it seems that most of the gain is with 10 agents, a bit more with 20, and there are diminishing returns after that. Apparently, more agents isn't going to do it.
Is this not an incredibly expensive/unsustainable method? I agree with the sentiment that MoE is the way to go as the newer models will probably see diminishing returns. But the compute for a single prompt will suddenly increase 7-15 fold?
If GPT4 is 20x the price of GPT3.5, but it only takes 10x GPT3.5 runs to get similar quality of response (and likely faster), you'll still come out ahead.
I doubt that 10xGPT3.5 > GPT4. There are a lot of tasks that GPT4 can do and GPT3.5 just cannot. Also, in such cases I find that GPT3.5's hallucinations are quite consistent, so such a method is probably not gonna help.
Just reading the (current top) few comments and whimsically wondered at the super business model of companies offering LLM services: a car service that won't get you from point A to B unless you hail it n times. A detergent that must be applied n times before cloths come out ("probably") clean.
If a company is offering "Artificial intelligence" at a price, then isn't reasonable that you only pay for correct answers? If the company is offering car service, shouldn't you only pay if they take you to your destination?
Agreed, and if it fails often enough, isn’t the bar at which a human or general-purpose, traditionally structured automation is going to be superior pretty low? This is how I think this bubble will pop. No doubt, LLMs are a breakthrough tool, but I’m sincerely skeptical of all but the most granular applications.
Perhaps the moral is that diffusing LLM agent accountability has the same failure model as the pre-existing human one.
Companies usually offer a service or a product. If the company doesn't deliver what was agreed upon, then the customer can demand correction. If a taxi driver takes a needlessly convoluted route, charges too much, or doesn't bring you to the destination, you can complain to the taxi company. If the laundry didn't work, you insist on doing it again.
However, many activities are inherently fraught with risk or uncertain results since there are always things outside of anyone's control. A lawyer can't promise you prevail in a court case, but they have to advocate your case to the best of their abilities. A doctor won't guarantee that you become healthy again. No taxi driver will guarantee you that you will reach the destination in time, but they will bring you there. Atlassian won't guarantee you will meet a release deadline if you use their managed JIRA instance, but they will do their best to prevent data loss. And a company that basically sells access to a chatbot won't guarantee that it gives you correct results. Maybe availability guarantees.
I guess it's the difference between an ensemble and a mixture of experts, i.e. aggregating outputs from (a) model(s) trained on the same data vs different data (GPT-4). Though GPT-4 presumably does not aggregate, but it routes.
I understand the intension and reference you're making. I bet the implementation of GPT-4 is probably something along those lines. However, spreading speculation in definitive language like that when the truth is unknown is dishonest, wouldn't you agree?
Sure, I could it put it less definitively, but realistically, what else can it be? The transformer won't change much and all of the models, at the core use it. It's a closely guarded secret because it's easy to replicate.
I wonder how well a bunch of LLMs trained on personal computers, so fairly small, could perform together?
Train a LLM on your emails, train an LLM on a text book, download a bunch of arbitrary LLMs from the net you find interesting, throw them all together into a big pile, and use a moderator LLM that knows how to format their output into an assistant format.
So, the email LLM would try to autocomplete sentences from your emails, and the text book LLM would try to autocomplete sentences from the text book. People could offer LLMs to download, almost as a way of compressing information, download the LLM of your favorite programming language, and TV series, etc. The important part would be having a moderator algorithm that can shape these LLMs from dumb sentence autocompleters (barely more than a fancy Markov chain) into a coherent assistant format. For example, the text book LLM would just endlessly spew semi-random sentences from the text, but a good moderator algorithm could see that it has sufficiently answered the question and cut it off.
In short, it's interesting that separate LLMs can integrate with each other and strengthen each other and it makes me wonder if we could build modular LLMs.
Your idea inspired me to see what such a microstory based on your idea would look like (Of course generated by ChatGPT3.5):
> As I delved into my computer, eager to tackle my to-do list, I was met with an unexpected sight: a digital love triangle among the Language Models (LLMs). The Email LLM, with its quick wit, seemed to be engaging in flirtatious banter with the verbose Textbook LLM, while the Programming Language LLM watched on with amusement. I couldn't help but laugh at the absurdity of it all, but as the bickering between the LLMs intensified, I realized their antics were hindering my progress. With a mixture of frustration and amusement, I gently redirected the LLMs back to their intended purpose, finally able to accomplish my task amidst the chaotic comedy within my computer.
This is my go to method for pretty much every hard problem that I'm forced to solve where I don't have the domain expertise / interest / time. The trick lies in coming up with a clever similarity metric that incorporates penalties etc. You can even go a level deeper and use multiple similarity algorithms and then poll on top of them. Here's a taxonomy extractor for text that I made using similar principles that is surprisingly as good as anything else that I've seen - https://dash.scooptent.com/text
Seems like a pretty brute force approach of frankly just throwing more compute at the query (via semi-statistical means).
I'd be more interested in how to scale this via different agents. i.e. do we have say one type of agent that is specialized to produce ideas, while another is trained to evaluate ideas. Those sort of chains seem like they'd be powerful - if you can find a way to generalize it
Model ensemble is a classic method. Deep learning is always rediscovering and reinventing the classics. I believe that many people have used this method before, but they just haven't published a paper on it.
What's interesting is that each run of the model tends to converge to a different "local maximum" in the solution space, and some of these local maxima correspond to better performance than others. By running the model multiple times, we increase the chances of finding a higher-quality local maximum or even the absolute best solution.
This got me thinking: why is this ensembling step implemented as a higher-level abstraction on top of the base LLM, rather than being built directly into the neural network architecture and training process itself?
Well you’re right that LLM tooling is totally inadequate. At least we already have beam search. But the more boring answer (and why beam search is also uncommon) is that running the query multiple times is more expense.
How can you have majority voting based on text produced by LLMs, won’t every answer string be essentially distinct with extremely high probability if it’s over a couple dozens bits of output (which any useful LLM output would be, right?)?
This is a cool paper showing there is value in using an LLM multiple times, but in recent research we showed that with majority voting, quality can decrease past some point as you make more calls. Check out https://arxiv.org/pdf/2403.02419.pdf. It raises the natural question of how to design the best inference algorithm given an LLM you can call multiple times.
The premise of this work seems very interesting... But I wonder how practical it is from both a cost and time perspective. I am toying around with an AI Agents library and one of the annoying UX things I notice is the time it takes to get my answers, because each call to an agent (either GPT-4 or Claude 3) is kinda slow.
Besides the time, it feels quite wasteful token wise.
I'm skeptical this approach will be adopted by many in the AI Agent space, but of course I could be very wrong.
Reading the paper the thought occurs that IF the likelihood of correct response increases by the number of agents employed AND this involves application of a function (whatever) to select the 'best' from the possible answers, doesn't that imply that LLM has insufficient dimensions?
In other words, I am wondering if LLM hallucinations [sic] are in fact symptomatic of 'conflation' which could itself be the result of insufficient dimensions.
This is quite interesting because I've specifically tried this kind of basic ensembling for my NYT Connections benchmark and it didn't work. This is something everybody would try first before more complicated multi-step prompting, and yet since ChatGPT 3.5 I'm not aware of any papers showing that it works. It will be interesting to reproduce this result and learn more about how they set it up to make it work.
I remember hearing that Beam Search doesn't work well for LLMs, because it leads to repetitive, generic output.
The majority vote sampling technique in this paper sounds like it'd give similar output to Beam Search, because it's sampling sequences of tokens from a joint distribution. So why doesn't it give repetitive output like Beam Search does? What am I missing?
Having given this problem a great deal of thought, I have developed a strong intuition around this. I believe not only is AGI feasible, it is already doable.
For example, several hundred GPT-4 based agents specializing in different skill sets should be able to collaboratively solve many problems. Their ability to work on so many facets of the same problem will make them very effective against multidisciplinary problems.
What’s the catch? Well, the back and forth has to play out in a serial order, so it cannot be parallelized. At today’s abysmal inference speeds, it may take this AGI many times longer than a trained human. Now imagine the effectiveness of this method when we can speed up inference to several hundred times a minute. Now AGI suddenly becomes way more efficient than a human.
Does it take less compute to train N agents vs one large model? Seem like a big win. Can the majority of the training be done independently or in distributed fashion?
this study's got me worried. it goes against the hope that AGI won't just sneak onto the internet and do its own thing. with more work on making LLMs bigger and showing they can do more, the thought that this could actually happen in the future gets real scary, especially since i can run smaller versions on my own pc now.
This trend of "All You Need" in paper titles needs to die. The original "Attention is All You Need" used that title because it is literally true in their case. So many papers just use it as a meme now, and it distracts from the true insight of the paper.
it's uncreative/tired but at least to the point. Too many papers are confused / opaque agglomerations of a year's worth of research shoehorned into a paper. At least with these you can fairly easily assess whether the claim is supported or not.
"A meme is an idea, behavior, or style that spreads by means of imitation from person to person within a culture and often carries symbolic meaning representing a particular phenomenon or theme."
Paper authors use "All You Need" to allude to the well-known transformer paper, even if their proposed technique is not in fact all you need.
My understanding is that a meme spreads through memetics similar to how a gene spreads through genetics. Ideas will spread that are fit for reproduction, i.e. communication.
Oftentimes memes are powerful at reproducing, but in science we don't want ideas that are likely to spread. We want ideas that are truthful
How about swarms of autonomous agents, such as AutoGPT, maybe thousands per human eventually, amassing karma points on all forums, including this one?
I can see in a few years each human being surrounded by a ton of LLM agents, "shepherding" their views, downvoting their messages or distracting them with argumentative conversations if they don't conform, and facilitating reputational attacks on scale on all the people whose speech is recognized as being contrary to what's desired.
Of course, there wouldn't be just one group deploying these swarms. It would be lots of different groups, akin to slaughterbots video: https://www.youtube.com/watch?v=O-2tpwW0kmU
The difference is that there wouldn't be physical violence, it would just gradually turn the entire Internet into a dark forest.
Averaging LLM outputs will ensure the final output will contain a lot of words with no substance. However, it’s essential to recognize that averaging bad data doesn’t always lead to better results. Garbage in, garbage out — averaging cannot magically transform flawed inputs into accurate outputs.
This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.
Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer. (Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic)
And this simple algorithm preform just as well (and sometimes better) than all the other multi-agent algorithms.
This suggests that the other multi-agent schemes with their clever prompts aren't really doing anything special; Their improve results are coming mostly from the fact that the LLM is run multiple times, that the prompt asks the LLM to pick the best answer.