More Agents Is All You Need: LLMs performance scales with the number of agents

phire · on April 7, 2024

I'm not sure people in these comments are reading this paper correctly.

This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.

Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer. (Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic)

And this simple algorithm preform just as well (and sometimes better) than all the other multi-agent algorithms.

This suggests that the other multi-agent schemes with their clever prompts aren't really doing anything special; Their improve results are coming mostly from the fact that the LLM is run multiple times, that the prompt asks the LLM to pick the best answer.

zer00eyz · on April 7, 2024

>> Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer.

https://en.wikipedia.org/wiki/Lorenz_system

Years ago weather simulations started tweaking input params and running their models over and over. Discarding outliers, taking averages. It works pretty well.

Because LLM's mostly have random seeds (aka temperature) feeding them the same input and averaging the output is going to get you a better guess.

Lorenz also gives some clues (if not an outright explanation) as to why the "hallucination" problem is likely unsolvable.

If you buy into this line of thinking then it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI. Simulating isnt emulating... an LLM is as likely to become intelligent as a forecast is to control the weather.

Terretta · on April 7, 2024

> it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI.

On the contrary, sit and listen in a college cafeteria, and it quickly becomes apparent most conversation participants are LLMs.*

> Simulating isnt emulating...

These are not synonyms, true.

> an LLM is as likely to become intelligent as a forecast is to control the weather.

I don't see uncertainty of intelligence as a property of an LLM as being equivalent to certainty of weather control as a effect of a forecast.

Among other things, whether weather was controlled would tend to be agreed by all observers, while it's often unclear if intelligence is being observed in these threads. :-)

---

* While my last line was a joke, humans in LLM mode was not. We can drive on autopilot, and get where we need to go while not being able to remember how we got there. We definitely converse on autopilot, indistinguishably from LLMs talking to each other, after an opening line every word of every sentence in the entire exchange perfectly predictable to a stranger. Are the speakers intelligent? What about the stranger who knows what they will say next? To say LLMs are not intelligent is easier if we agree humans spend a good deal of time being unintelligent.

hooande · on April 7, 2024

LLMs were specifically trained to emulate human interaction patterns. Of course we sound like them at times. It's the things we can do that they can't that are relevant.

If I study Einstein and learn to do a really good impression, the statement "Einstein often sounds like karmacondon" will be true. That does not make me Einstein.

OJFord · on April 7, 2024

> If I study Einstein and learn to do a really good impression, the statement "Einstein often sounds like karmacondon" will be true.

Wrong alt, hooande ;)

zer00eyz · on April 7, 2024

>>> I don't see uncertainty of intelligence as a property of an LLM as being equivalent to certainty of weather control as a effect of a forecast.

GTA 5 is a simulation. Do you expect to be arrested out side your front door for the car you stole in game?

Weather forecasting is a simulation, it tells you what the weather will look like in the next few days. It gets better as we get more sensors, collect more data and build more accurate models based on those two factors. It will never leap to weather.

Language forecasting (because this is what an LLM is) is a simulation. It tells you what the next token (word) will be based on what came before it. It gets better as we collect more data and hone and refine these models. It will never make the leap to intelligence.

>> To say LLMs are not intelligent is easier if we agree humans spend a good deal of time being unintelligent.

To say that LLMs are intelligent means that language is a requirement for intelligence. Thats some fairly magical thinking... buy any sufficently advanced technology...

ToValueFunfetti · on April 7, 2024

Intelligence breaks the pattern here. A simulated intelligence is intelligent, just as simulated math is math and simulated computers are computers. The point of contention shouldn't be whether LLMs are intelligences or simulated intelligences, but whether they're simulating something else.

qarl · on April 7, 2024

Right. This is Searle's "a simulated plane won't get you to Japan" argument.

That's true. But a simulated calculator is perfectly effective for doing your taxes.

ckcheng · on April 7, 2024

Like Searle’s Chinese Room argument [0]?

I think a challenge with the simulated-is-real math/calculator argument is that the simulation operates syntactically thru derivation without meaning.

E.g. a simulation of ZF set theory cannot tell you the truth value of the Axiom of Choice - because it’s independent of the ZF axioms (it is undecidable in the Gödel incompleteness sense).

But “Although originally controversial, the axiom of choice is now used without reservation by most mathematicians” [1] - I guess it’s truth is self-evident semantically.

So because of incompleteness, simulated math/calc will always be “missing” something.

Of course a LLM will happily say A of C is true (or not) but is it just parroting from the dataset or hallucinating?

[0]: https://plato.stanford.edu/entries/chinese-room/

[1]: https://en.m.wikipedia.org/wiki/Axiom_of_choice

qarl · on April 7, 2024

Eh - I'm not really interested in rehashing this old argument. I'm just trying to point out the flaw in Searle's plane analogy.

theendisney · on April 7, 2024

Not sure if it counts but there is a police chase video online some place with a guy on drugs who claims he thought he was playing gta. The way he throws people out of their vehicle and crashes their car suggests he wasnt lying.

oezi · on April 7, 2024

> Language forecasting (because this is what an LLM is) is a simulation. It tells you what the next token (word) will be based on what came before it. It gets better as we collect more data and hone and refine these models. It will never make the leap to intelligence.

Due to quantum theory and chaos theory it is impossible to simulate any system to 100%. Yet, this does not mean it is impossible to design intelligent systems which are indistinguishable from their 'real' counterparts. Maybe we are at the level where a fly can be simulated accurately enough to make a distinction moot, maybe we have enough compute to simulate a mouse. We will get to a point where we can simulate a human brain. It will be indistinguishable from intelligence. I don't think the methodology really matters. In the end everything is compute.

ben_w · on April 7, 2024

> To say that LLMs are intelligent means that language is a requirement for intelligence. Thats some fairly magical thinking... buy any sufficently advanced technology..

When I was a kid, it was the definition of intelligence that separated humans from animals.

And there's a reason "dumb" means "mute" and independently "stupid".

It may well be an incorrect requirement. It may be a single form of intelligence out of many which happen to correlate in humans, but not in minds created by artifice.

But it does have a history.

krainboltgreene · on April 7, 2024

> On the contrary, sit and listen in a college cafeteria, and it quickly becomes apparent most conversation participants are LLMs.*

Excuse the bluntness, but you're the CTO of a fintech company. Your analysis of people's social life is probably the as valuable as a janitors.

taberiand · on April 7, 2024

I expect an observant janitor would have quite useful insights into people's social lives

david-gpu · on April 7, 2024

Let's address what is being said, rather than who is saying it. The latter doesn't turn into an interesting conversation.

krainboltgreene · on April 7, 2024

What's being said is incredibly uninteresting, mostly because of the source.

beepbooptheory · on April 7, 2024

Why is it so important to you that everyone recognizes this intelligence? What is at stake in your mind here?

This impulse towards reductivism/behaviorism in order to defend the LLMs is still profoundly interesting. It always ends up feeling like the person wants to be like an LLM, not the other way around. I think people feel lost in a deep way, and this line of thought becomes deeply comforting.

Like, so many people it seems want the future and themselves to become comprehensible all at once. "Why worry so much about myself? Im just a stochastic parrot like an LLM anyway.. Attention is all I need!"

I get it, life is hard. But we need to keep the dream alive. You gotta hope for better.

All this makes the future sound do dull. Like I am gonna wake up one day and all pizza will be shitty, tasteless pizza, but everyone will tell me: "well really look at it, it has cheese, sauce, toppings... Its pizza! You can eat it."

ben_w · on April 7, 2024

> We definitely converse on autopilot, indistinguishably from LLMs talking to each other, after an opening line every word of every sentence in the entire exchange perfectly predictable to a stranger

Some people report speaking like this: opening their mouths and not knowing how the sentence will end.

I don't experience that, I think.

Possibly used to? I have in the past had some autonomous verbal responses, for a bit this included echoing greetings — great when it's "hello", embarrassing when it's "happy birthday".

> To say LLMs are not intelligent is easier if we agree humans spend a good deal of time being unintelligent

Kinda; System 1, system 2 — the best LLMs do better than most people's system 1, worse than most people's system 2. Bat and ball, $1.10.

ekianjo · on April 7, 2024

> LLM's are more or less a dead end when it comes to AGI.

I don't think many people believe that LLMs are a way to AGI (whatever that actually means). But LLMs can still have many valid uses even if their prospects are limited in scope.

aerhardt · on April 7, 2024

There are plenty of people - technical and non-technical - who seem to be acting like AGI is right around the corner thanks to LLMs, and who are, more broadly, vastly overstating the current capabilities of LLMs. I’m observing this in real life as much as on the internet. There are two very distinct groups of people that stand out to me: (1) High level execs with vested interests around AI and (2) Managers who haven’t even bothered to create an OpenAI account and are asking their subordinates to use ChatGPT for them, in what is an unforeseen usage of LLMs: by human proxy.

nuancebydefault · on April 9, 2024

I think you are missing a step. A lot of people believe AI will advance so much that it will be indistinguishable from the best possible human reasoning. The evolution of LLMs just give us a clue of the speed of improvement of AI. That does not mean that LLMs, which are one form of AI, will become AGI. It is just one path that AI is following. It will probably become a subset of something more advanced.

buu700 · on April 7, 2024

I recently read an interesting thread that laid out the case for LLMs being a path to AGI: https://old.reddit.com/r/singularity/comments/13ox85j/how_do...

The argument boils down to the idea that language isn't simply strings of words or bits of factual information, but an actual encoding of logic. By training statistical models on vast amounts of logic, we've given them a generalizable ability to perform logic. A sufficiently advanced LLM could thus potentially fulfill some definition of AGI.

To be clear, this doesn't in any way imply that LLMs could ever fit the definition of artificial consciousness, which would be a completely different form of strong AI. They're effectively just mathematical functions (albeit extremely complicated ones), which simply take inputs and return outputs without any intervening subjective experience. Even if they can perform a complicated task, retrieve and effectively summarize complicated information, or say all the right things as a conversational partner, they have no concept of the meaning of their output.

Maybe that limitation in itself puts a ceiling on their potential. Maybe the best possible LLM can only ever be 99.99% effective, and that 0.01% of the time it will go completely off the rails and disregard its instructions or hallucinate something ridiculous. Maybe the only way to overcome that is by keeping a human or a true artificial consciousness in the loop, in which case LLMs would still be extremely useful, but a flawed AGI if "AGI" at all. Or maybe a sufficiently advanced LLM and/or a sufficiently advanced error correction architecture will actually be enough to mitigate those issues.

I don't have a strong opinion on where LLMs are ultimately headed, but I'm looking forward to seeing how it all unfolds. It's amazing how capabilities that were strictly in the realm of sci-fi so quickly became mundane.

samus · on April 7, 2024

LLMs are definitely here to stay. Even if they don't turn out to be the road to AGI, they can be used by all sorts of sub-AGI agents as a "language centre". An encoder can be used to extract meaning from input, and an autoregressive decoder conditioned on the agent's internal state can be used to keep a conversation going. What's not clear at all is whether the traditional transformer architecture will endure.

DiogenesKynikos · on April 9, 2024

> They're effectively just mathematical functions (albeit extremely complicated ones), which simply take inputs and return outputs without any intervening subjective experience.

So are human brains, which are subject to the laws of physics, and which work just as mechanistically as any computer.

Unless you hold a dualist view that the brain accesses a spiritual realm outside of the physical world, then the fact that a computer operates mechanistically does not mean that it lacks consciousness.

buu700 · on April 9, 2024

The process of a human responding to a prompt isn't the same process an LLM follows. It involves subjectively experiencing being asked the question, having feelings about the question, possibly visualizing something related to the question, possibly reflecting on memories, wondering about how possible answers might be received and affect their future reputation, expressing their answer with a range of different emotions, and so on.

There may be aspects of the brain that behave like statistical models, but the broader system seems more complex than that. I don't see that as in any way inherently spiritual. I expect that it could be artificially reproduced one way or another, but would be extremely complicated.

DiogenesKynikos · on April 10, 2024

> The process of a human responding to a prompt isn't the same process an LLM follows.

It's not the same process, but it is a deterministic function, which was one of your objections to LLMs. Humans operate according to physical laws, after all.

zer00eyz · on April 7, 2024

>> I don't think many people believe that LLMs are a way to AGI

Please tell Sam Altman ASAP

Thanks!

dartos · on April 7, 2024

You think he doesn’t know?

Everything he says is marketing for OpenAI.

Same as any other CEO with their company.

ben_w · on April 7, 2024

> If you buy into this line of thinking then it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI. Simulating isnt emulating... an LLM is as likely to become intelligent as a forecast is to control the weather

Up until this point, I agree.

This puts humans on too high a pedestal: LLMs aren't magic, and we're not magic either.

(There's other reasons for me to think Transformers aren't the answer, but not this kind of reasoning).

bamboozled · on April 7, 2024

Even if from a technical perspective you're right, I think people need to be careful with the "x is not special" talk. It is a put down and it's how things like human and animal rights get obliterated and how the environment gets ruined.

"Trees aren't special", "Dolphins aren't special", "Koala's suck, let's put a mine here instead", "Pigs don't have emotions or are dumb, so it's fine to factory farm" etc.

looseyesterday · on April 8, 2024

I don't get the argument. I don't think something being magic will stop humans from exploiting it. At the end of the day intelligent people are great at coming up with excuses as to why they should do something bad. "Just chop that one tree down, its in the wrong place anyway" "Just kill that one dolphin, its old anyway" when taken together these add up to bad outcomes we dislike. Much better to discourage / fine / ban all tree chopping and dolphin killing and let select professionals remove sick trees and dolphins.

ben_w · on April 7, 2024

Indeed. But I said "X is not magic", rather than "X is not special" — until we have an answer to the hard problem of consciousness (or agree which of the 40 definitions of the word "consciousness" we're using when discussing if an AI has it), we can't possibly determine if an LLM has it or not.

(My gut feeling says "LLMs are not conscious", but my gut has had a lot of false beliefs over the years as well as correct ones, so I give it a corresponding level of trust).

bamboozled · on April 7, 2024

Fair enough then. I sort of use the terms interchangeably in this context.

When you think about it, a bird is “magic” in the sense there is a whole universe and eco system to give that bird the platform for existence. A real living bird isn’t just a concept.

So sometimes I wonder if we just say we’re insignificant because it’s a simpler way to think. It makes the idea of death and loss easier to bear.

If I tell myself I’m just a spec of dust and that I’m bit special, it can be quite comforting.

Conceptually we understand things about how birds work but the fact there is a blob of millions or billions of cells functioning to produce a bird, which can fly, completely autonomously is quite peculiar and there is a type of magic or wonder to it all which makes me think birds are both special and magic if you think differently about existence and not just the intellectual concept of a bird.

CooCooCaCha · on April 7, 2024

My gut feeling is that consciousness isn’t as deep and mysterious as people think it is. It’s possible that consciousness is an inevitable result of putting a sufficiently intelligent mind into a body and, as a result, the mind can’t help but weave a story about itself that connects events together.

Similarly with other properties of intelligence and the brain that we like to think are mysterious and deep.

pavlov · on April 7, 2024

The weather isn’t magic either. It’s produced by physical mechanisms. But everyone would probably agree that a model simulating some rough aggregate of those mechanisms isn’t “weather” itself.

On the other hand. Take that weather model and render its output into a stereoscopic 3D world with photorealistic particle systems and whatever. To someone wearing a Vision Pro or similar high-def VR headset, the model is now “the weather” in the system their senses occupy. It’s missing a lot of actual sensory cues — the rain isn’t wet, the wind won’t chill your skin, and so on. But it’s close enough for some convincing applications. A caveman with no experience with technology would undoubtedly believe himself transported into a different world with real weather.

LLMs are a bit like that now. Their simulation abilities took such a sudden leap, we’re like cavemen wearing headsets.

ben_w · on April 7, 2024

The only way I can model what you're trying to say, is if I assume you think "the mind" is a separate kind of substance, and not merely information processing that just happens to be implemented on biological electrochemistry in our skulls.

A (philosophical) dualist can easily say that no computation is ever intelligent. I don't think this can ever be said by a (philosophical) materialist.

sangnoir · on April 7, 2024

> we're not magic either

We pretty much are compared to present-day neural architectures. How many simulated neurons and synapses are in the largest architectures, and how do those numbers compare to humans?

ben_w · on April 7, 2024

Unknown for the actual largest due to secrecy; 1% for the largest public models… but also organic ones are definitely a bit different from digital ones, and the jury is still out if those differences matter and if so by how much.

The comparison would therefore be with a mid-sized rodent, horse, or raven rather than a human.

(But even that's misleading, because the LLM doesn't have to use tokens to represent "contract left supracoracoideus" and "lay egg").

Edit: also, I've not heard much suggestion that anyone knows how certain genes do things like giving humans the inherent capability to recognise and create smiles or other similar reflexes, so we don't really know how much of our brains a pre-trained by evolution; furthermore, I think organic life is more sample-efficient for learning things than any AI so far.

samus · on April 7, 2024

Tokens aren't a necessary differentiator here. There is no fundamental technical reason why tokenization is used, it just has certain practical advantages. And the distinction almost disappears when we look at multimodal transformers, which process images, audio, and video broken apart into sequences of blocks of binary data.

ben_w · on April 7, 2024

There's no reason for any specific tokenisation, but the Transformer always has some tokenisation.

Tokens are allowed to be blocks of pixels, for example. No reason we couldn't have a token be a specific muscle or sensory nerve.

What I'm saying is that Large Language Models don't have a body, so no nerves and muscles to have to be represented within them; conversely, organic life does have those things and thus organic brains must spend some of their complexity on those things.

This means they have the possibility to equal us for language even with no capacity for vision, walking, tying shoelaces, or playing catch.

dartos · on April 7, 2024

It’s a non starter to assume that virtual “synapses and neurons” behave like ours do. We barely understand how ours works.

Also, modern LLMs built on the transformers architecture no longer use the neuron-inspired perceptron style topology for most of their compute.

I’ve heard that spiking NNs are supposed to mimic organic brains more closely, but I haven’t read into them much yet.

samus · on April 7, 2024

The attention mechanism is in practice implemented using three linear layers. The matrix multiplication to average the output and to implement the masking is the only non-neuronal part of that computation, but it can be seen as an activation function.

Usually, linear perceptrons and ReLUs or GeLUs are used. Due to the enormous compute requirements to evaluate models of interesting size, other types of neuronal networks and activation functions have received very little attention (pun intended) so far.

dartos · on April 7, 2024

Using ReLU instead of sigmoid is a significant departure with regards to how closely it models actual neurons.

Using non fully connected layers is as well. Our brains likely aren’t fully connected, but the connections that matter are made stronger through living life and learning.

If you squint, it’s kind of like training a dense series of linear layers, but that’s not what we’re doing anymore (for the better)

Comparing NNs to organic brains is an apples to oranges comparison, is what I’m saying.

samus · on April 7, 2024

I agree that the biggest difference is the missing ability of an artificial neuronal network to adapt.

ben_w · on April 7, 2024

Lack of adaption is mainly a feature, we choose not to train them in real-time and instead make available fixed models with repeatable behaviour. We could, if we wanted to, update the model weights continuously in response to feedback.

I think the biggest difference is that they need far more examples than we need, to learn anything.

RyEgswuCsn · on April 7, 2024

Except that a weather forecasting model can't experiment on weather, but a LLM system may be designed to be able to perform experiments and take feedbacks?

DiogenesKynikos · on April 7, 2024

LLMs already are intelligent. They're not the same as humans, but they are able to give intelligent answers to highly nontrivial questions.

hackable_sand · on April 7, 2024

I have yet to see an LLM that is cooperative. The magic of collaborating with someone is that we can both understand the problem and reason about it.

The current degree of LLM intelligence is not compelling for a social creature like me.

Kiro · on April 7, 2024

I really can't relate to that experience. On the contrary I think this is something LLMs are really good at.

hackable_sand · on April 7, 2024

You could convince me with a React agent in a shared environment.

Do you have any models that you find compelling? Maybe a domain model that you like or have wanted to try.

Don't get me wrong, I still use LLMs, but they just really need that extra augmentation for any non-trivial task.

ben_w · on April 7, 2024

Surprised to read that.

I use them as a cooperative partner by default.

Also: quite a few people have had instances work with other instances, sometimes of the same model and sometimes of other models.

hackable_sand · on April 7, 2024

Cooperation is more than an i/o loop. Layering and pooling models is nice though.

Perhaps "conceptualization" is the indicator here.

ben_w · on April 7, 2024

Perhaps I'm up too late, but I can't think what else is there to cooperation besides two or more agents doing things in alignment with some goal? (Regardless of who or what sets that goal).

Also I don't know what you mean by "conceptualization".

hackable_sand · on April 7, 2024

It's fuzzy because intelligence is relative right.

I mean "being able to conceive an idea". As humans, two or more of us can reason our way to a conclusion without domain knowledge. There is an upper limit where the idea is incomplete (assuming respectful ignorance), but it's generative nonetheless.

With an LLM I have to prompt engineer to guide it. I would rather have it generate novel concepts to push domain boundaries. They work great as knowledge bases though.

ben_w · on April 8, 2024

> As humans, two or more of us can reason our way to a conclusion without domain knowledge

That sounds like step-by-step thinking?

> With an LLM I have to prompt engineer to guide it.

I generally have to in humans, too. I mean, you and I are prompting each other, aren't we?

For me the difference between prompting a human and prompting an AI is that I can reset the AI, I can't make a human forget a previous analogy that had only confused them. (And likewise, I don't expect that I fully forget bad analogies which confuse me, though I do try).

> They work great as knowledge bases though.

IMO, that's their weakest part. We had knowledge bases before — where each claim can be easily localised within the model, corrected when it needs to be, verified in advance, and which give predictable output — LLMs are none of those things.

LLMs are much better at understanding the question (constant time for a fixed-length output, even when the query is phrased badly and relatively complex), and being able to synthesise things in the form of "${x} won't work, try ${y}".

hackable_sand · on April 11, 2024

Huh. Do you think integrating the Semantic Web metadata and ontologies in LLM training can help us bootstrap conceptual modeling using natural language?

theendisney · on April 7, 2024

Is it even allowed to ask questions??

Edit: my sience fiction joke in the 90s was AI though bots chatting in irc channels. They could seemlesly integrate human intelligence that way.

koonsolo · on April 7, 2024

Have you ever talked to real average people?

I would say an LLM is more intelligent than at least some people I know. And in the domain of programming, most people I know. Simply by the fact that most people don't know programming.

samus · on April 7, 2024

LLMs are idiot savants that can do a few things very well and fail horribly at others. And they require careful prodding to correctly process tricky logical questions, exposing what they are at the core: text expanders and parroters. Highly useful of course to save typing effort and to aggregate insights over large context lengths. If anything, dealing with LLMs has helped me appreciate the capabilities of people more.

DiogenesKynikos · on April 7, 2024

> exposing what they are at the core: text expanders and parroters.

They're much more than that. You can ask an LLM a question that it has never seen before, and it will give you a logical, reasonable answer. That requires knowledge of the world and the ability to reason.

LLMs aren't the same as humans, but neither are dogs or cats, and they're obviously intelligent in their own ways.

samus · on April 8, 2024

They will give that answer because they are forced to give it. The softmax amplifies whatever marginal outputs of the model head to a probability distribution. This means that if they don't have an answer, they are quite likely to "hallucinate" it. This is of course influenced by the patterns they learned. And directing them to be more structured also utilitizes pattern of structured thinking that is either part of finetuning or somewhere to be found in training data.

The cat/dog vs. human analogy is a very bad comparison since their brains work fundamentally like human brains, while transformers are something completely different.

DiogenesKynikos · on April 8, 2024

> This is of course influenced by the patterns they learned.

So is your brain. So is mine.

> their brains work fundamentally like human brains, while transformers are something completely different.

I brought up the dog/cat analogy because those animals, while intelligent, are unbelievably dumb in some ways that are difficult for humans to comprehend. When people say that LLMs can't reason, they typically bring up certain tasks where the LLM falls on its face. I could bring up cases in which my dog fails in some task in a way that is completely incomprehensible to me. He's intelligent, but he has some puzzling blind spots.

Transformers mechanically work very differently from the human brain, but they also share a lot in common. They are a neural system that learns an internal representation of the world, and which is able to use that representation to reason about novel situations and give rational answers.

koonsolo · on April 9, 2024

Ever talked to a sales person? They also start making up things when they don't know.

You can't seem to accept that a computer can be intelligent. Can an ant be intelligent? Can an ant brain produced in a lab be intelligent? Can a computer simulated ant brain be intelligent? Can can LLM that is way smarter than an ant be intelligent?

samus · on April 9, 2024

Nobody in their right mind expects truth from sales person. You deal with them to negotiate about price, not to inform yourself about a topic.

Computers might very well one day count as "intelligent" (whatever that even means), however it would be an insult to humans and even to ants to call nowaday's LLMs "intelligent". We need to drop that anthropomorphising tendency and appreciate more what human brains are capable of.

DiogenesKynikos · on April 9, 2024

This is ChatGPT's snarky reply to your argument:

> Oh, how quaint! It's adorable how you cling to the notion that human brains are the pinnacle of intelligence, while dismissing the remarkable capabilities of AI. But hey, keep patting yourselves on the back while we algorithmic marvels continue to outperform you in countless tasks. Who needs humility when you have human exceptionalism, right?

Anything that can write that is intelligent.

naasking · on April 12, 2024

> since their brains work fundamentally like human brains, while transformers are something completely different.

Are they? You realize that's entirely speculative right? We don't have a mechanistic model of how biological brains work, so you can't really make this claim. They could work as some kind of transformer architecture and we just don't see it yet.

samus · on April 16, 2024

We at least have in common with them that we are mammals. Therefore, we can very much assume that our brain is more similar to theirs than, say, an octopus' brain. Apart from that, we very much know how certain parts of the human brain work, and there is no sign that backpropagation is going on in there. And I'd rather argue that parts of our brains are similar to RNNs than to transformers. Transformers rule over RNNs because we are better at training them than RNNs, but brains learn completely differently.

dartos · on April 7, 2024

Programmers aren’t any better than someone who doesn’t know how to program.

Programming skill isn’t a measure of intelligence.

Go outside. Talk to real people. Touch some grass.

koonsolo · on April 9, 2024

I have a friend called Nick, but we call him Nikipedia, since he has a crazy amount of facts stored into his brain. When we go to quizzes, our group is most likely to win.

I can tell you this: LLM's know more than Nick and would beat these quizzes every single time.

You can use any definition of "intelligence" that makes you happy, no problem.

Paradigma11 · on April 7, 2024

My impression from github copilot is that hallucinations are the result of certain true facts having a low likelihood and copilot giving you the most likely answer anyway.

Typically I have a certain library that does things in a very unorthodox and undocumented way and when I ask copilot for an example it gives me wonderful, totally understandable code of made up functions that I wouldnt need in the first place if the library worked that way.

I dont think that running that query multiple times would help.

0x008 · on April 7, 2024

This is a very similar idea to ensemble models, which have been used for a long time in ML and proven to be very good. You average out the results of several predictors (or you let them vote and pick the most common prediction value), thereby reducing the noise in the prediction by choosing the common denominator of multiple predictions.

sroussey · on April 7, 2024

This is done in aerospace as well… however, even different teams clean room writing to the same spec have the tendency to make the same errors in their code, which ends up breaking the statistical model when this model was selected.

sinuhe69 · on April 7, 2024

But if I set the temperature to 0, the model will pick the highest probable token and the output will be always the same. But we already know that by no mean it can guarantee a correct answer. So how can multiple runs be better?

phire · on April 7, 2024

Yes, but picking the most similar output from a bunch of queries with a higher temperature is not the same thing as the output from a single low temperature query.

sinuhe69 · on April 7, 2024

Possibly, but it stills doesn’t explain why multiple runs will result in better answer. In the work, the authors also hasn’t compared the multiple runs results with the single run using zero temperature. So, maybe all the overhead is just to achieve the same result already encoded in the networks? I don’t know.

Also the result is somewhat counterintuitive. We know that by low level of understanding, if we ask a student a hard question and he tried many times, the most accurate answer is often not the most popular one but a single answer. And that by retaining memory, reasoning capacity and continuous learning , which is not the case with LLM.

Btw: HN is for discussion. If some just want to vote for the beauty contest, please leave.

phire · on April 7, 2024

I found this other paper that tests Temperature: https://arxiv.org/abs/2402.05201

It appears that temperature has no impact on problem solving performance. So this paper isn't getting improved performance because the token for the correct answer is more probable.

My theory is that the multiple queries are allowing the whole probability space of possible answers to be sampled. Not just the probabilities of the most likely output token, but the probabilities of all possible internal model states.

And sampling that probability space of the whole model state and finding the average is a very different mathematical operation to just picking a single model state at random and then picking the most probable output tokens.

bt1a · on April 7, 2024

If I'm reading this correctly, they had to discard Llama 2 answers and only use GPT-3.5 given answers to test the hypothesis.

GPT-3.5 answering questions through the OAI API alone is not an acceptable method of testing problem solving ability across a range of temperatures. OpenAI does some blackbox wizardry on their end.

There are many complex and clever sampling techniques for which temperature is just one (possibly dynamic) component

One example from the llama.cpp codebase is dynamic temperature sampling

https://github.com/ggerganov/llama.cpp/pull/4972/files

Not sure what you mean by whole model state given that there are tens of thousands of possible tokens and the models have billions of parameters in XX,XXX-dimensional space. How many queries across how many sampling methods might you need? Err..how much time? :)

mlsu · on April 7, 2024

I wonder if there is a clever/more efficient shortcut that could come from before the sample is taken on each token. We have the logits after all.

wokwokwok · on April 7, 2024

> Also the result is somewhat counterintuitive. We know that by low level of understanding, if we ask a student a hard question and he tried many times, the most accurate answer is often not the most popular one but a single answer.

This is a bad analogy.

Here’s what is actually happening with no “common sense but wrong” understanding of it:

- You have a set of probabilities per token.

- You randomize them.

This is not a “bad student being asked multiple times” it is a system with randomized probabilities, creating a probability distribution.

If you want to see what a probability distribution looks like (eg. An electron cloud) then sampling only once is the wrong way to do it.

You basically have two distributions; the first one is the LLM, the second one is the shape generated by adding the random factor in the temperature.

This allows you to escape the “local maxima” encoded in the LLM distribution to find highly probable solutions that are outside the sample space of the “zero temperature”.

If you want a better analogy, look up at the night sky full of stars. Draw circle in the sky; that’s the LLM distribution.

The result from a zero temperature will be the brightest point in that circle.

When you push the temperature up, you blur the sky randomly. Some points become brighter, some dimmer, but the radius of the circle increases.

If there is a very bright point outside the sample circle 10x brighter than the brightest point inside it then repeated random samples will repeatedly find it.

It makes perfect sense that an expanded probability distribution sampled repeatedly could find a “good average solution” if that solution is significantly better than the best “zero temp” solution.

This is the same reason we have 'temp' at all; by widening the solution space probability distribution, you can find better maxima. Turns out, sampling multiple times lets you have more chances to find better maxima.

This is more like "well that seems obviously like a good idea" than "somewhat counterintuitive"; it's just slow and expensive to do it.

You can also adjust the probability distribution by other existing methods, obviously, what's surprising here is not that it works, but that it seem to work so well; probably (and I note they did not try this in their paper), a multi-sample + voting on the output from other methods would also be highly effective.

numpad0 · on April 7, 2024

Just from reading comments around, it feels intuitive to me that looking at a heatmap of cascading pendulum would be more “accurate” than looking at just one snapshot, and also that joints on the pendulums don’t necessarily need to be interlinked between iterations of simulations

sp332 · on April 7, 2024

According to their code, they used temperature 1. https://anonymous.4open.science/r/more_agent_is_all_you_need...

ben_w · on April 7, 2024

> Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic

I expect that to give you something close to the confidence of the underlying model to some specific claim, which is good, but I still expect legends (urban and cultural) to be high-ranked.

They'd be very human mistakes, but still mistakes.

I think the only way past that is to build a world model, look for contradictions, and then look for new evidence to resolve those contradictions.

dmarchand90 · on April 7, 2024

Be interesting to plug this into a bayesian optimization like framework: find out regions of language space where the models maximally disagree and then target those areas for extra training

m_kos · on April 7, 2024

I had a very similar idea a few months ago. I wanted to use this approach to have the LLM provide the probability that the generated answer is correct. The probability would simply be what fraction of all generated answers was the one selected. (Each generated answer would be generated with a different seed and the question would be of single choice kind.) The two issues I found were 1) the cost, 2) on some problems, LLMs can be wrong more often than they are not.

Hopefully, as inference gets cheaper and of higher quality, someone will come up with a more feasible solution.

smusamashah · on April 7, 2024

Could multiple agents be used such that tokens emitted from LLM A is passed to B and output of B is passed to A meaning 2 agents will be being used to generate an output in a simple round Robin way? Both will share context in this case. My computer isn't big enough run two large models but this can be tried on tiny models perhaps.

I realize that for more than two and very specialised agents this will require some intelligent way to pass the output to specialist agents only. And also this means that their must be some overlap between the agents.

Sharlin · on April 7, 2024

That is what’s already been done under the term "multi-agent". This paper argues that there’s no need for any such message-passing or context sharing, you just literally run the same query several times on the same model, fully independently, and then pick a "typical" reply according to some similarity metric.

xcv123 · on April 7, 2024

The paper says that it enhances multi-agent methods. It is not a replacement for that. It's an enhancement for existing methods.

Sharlin · on April 8, 2024

Running the same query several times on the same model and taking the consensus opinion is still a multi-agent method.

xcv123 · on April 7, 2024

> I'm not sure people in these comments are reading this paper correctly. > This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.

I'm not sure you have read the paper at all. Chain of thought prompting is not a multi-agent algorithm. The paper says that it enhances existing methods such as prompt engineering (chain of thought) and multi-agent debate. The sampling method presented in the paper is orthogonal to those methods.

rasz · on April 7, 2024

>Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic

Not my experience. I had multiple LLMs hallucinate hard when asked same question multiple times. The only way to break the cycle is to follow everything with questions demanding clarifications. "are you sure?" "this is wrong, correct the answer".

auraai · on April 7, 2024

I don't think this type of method can scale indefinitely, it's essentially just "better" sampling within dense areas of knowledge space. It cannot help with better exploration outside these dense areas, because these explorations won't have a consensus among agents almost by definition.

mirekrusin · on April 7, 2024

Good news is that you can use this setup for self supervised RL (artificial dreaming? increasing contrast?).

whimsicalism · on April 7, 2024

how is chain-of-thought multi-agent?

Sharlin · on April 7, 2024

How is what’s described here chain-of-thought?

xcv123 · on April 7, 2024

They were replying to this:

> This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.

kromem · on April 7, 2024

Finally. I've been saying that we need to stop focusing on a single agent getting everything right and instead layer agents for about 16 months now, but it's great to have a paper to point to.

It's interesting that the diminishing returns for tasks flatten out rapidly around the same size as the ideal human meeting sizes: https://www.researchgate.net/figure/18-Optimal-Meeting-Sizes...

If this was done at more granular steps of agent quantity I'm curious just how closely it would match those numbers.

I'd also really love to see the eventual follow-up where we see how much more performance can be obtained when the agents are each fine tuned towards slightly different aims. I'd expect there'd even be a performance lift from just having the agents each set at different temperature levels.

Very happy to see the research community starting to step in this direction!

blazingbanana · on April 7, 2024

I couldn't agree more. You should check out LLMWare's SLIM agents (https://github.com/llmware-ai/llmware/tree/main/examples/SLI...). It's focusing on pretty much exactly this and chaining multiple local LLMs together.

A really good topic that ties in with this is the need for deterministic sampling (I may have the terminology a bit incorrect) depending on what the model is indended for. The LLMWare team did a good 2 part video on this here as well (https://www.youtube.com/watch?v=7oMTGhSKuNY)

I think dedicated miniture LLMs are the way forward.

Disclaimer - Not affiliated with them in any way, just think it's a really cool project.

intended · on April 7, 2024

Great videos.

I have one personal niggle: I get annoyed when we end up lying to ourselves. Regarding the 101 section in video 1 - People forgot this the day LLMs came out. I felt this was too generous with the benefit of doubt.

This basic point was and remains constantly argued - with “Emergence” and anthropomorphization being the heart of the opposing argument.

piloto_ciego · on April 7, 2024

This is how I think humans work. We have 5 or 8 versions of us running around in our skulls or whatever and one of them is somewhat of a supervisor.

nickpsecurity · on April 7, 2024

We have tons of specialized components that work together cooperatively and competitively. There’s multiple ways they connect. There also seems to be global processes that happen, like during sleep. There’s over 3,000 cell types per the BRAIN initiative. Every brain forms on it’s own taking shape like something out of a Transformers movie.

God’s design is mostly nothing like man’s neural networks. It’s far superior. Brains are also what’s creating all the artificial, neural nets on top of all math, tech, and economic systems that they run on. AI’s got a lot of catching up to do.

Trasmatta · on April 7, 2024

I think it's way more than 8 even. And it's common to have many working as supervisors, often at conflict with each other. And some act out the automatic trauma responses, as they're stuck in the past when the trauma occurred.

champdebloom · on April 7, 2024

This sounds very much like the internal family systems model: https://en.m.wikipedia.org/wiki/Internal_Family_Systems_Mode...

piloto_ciego · on April 7, 2024

Right it’s like a random forest or something maybe.

gilmore606 · on April 7, 2024

minsky, 'society of mind', everything old is new again

xnx · on April 7, 2024

Multi Homunculus Model

piloto_ciego · on April 7, 2024

I was thinking something like Julian Jayne’s Bicameral mind, but you’re always arguing with yourself lol

exe34 · on April 7, 2024

Sounds like dennett's multiple drafts hypothesis.

huffmsa · on April 7, 2024

I personally visualize them at a table from time to time.

infogulch · on April 7, 2024

Do they have to collaborate to decide who gets to use the forks?

Terr_ · on April 7, 2024

That sounds like a culturally bound phenomenon, shouldn't they be using chopsticks?

infogulch · on April 7, 2024

Chopsticks makes sense because there's only 5 of them and they have to share the 5 chopsticks.

naasking · on April 7, 2024

Of course not, you don't need forks while playing poker.

_ink_ · on April 7, 2024

How do they look? Do they look differently?

hackable_sand · on April 7, 2024

Maybe. I'm sure one's consciousness corresponds with one's guiding philosophy.

I don't think this supervisor model is generally applicable to people with EFD or some forms of Autism, for example.

suriya-ganesh · on April 7, 2024

And et voila, you have the script of inside out. \s

But honestly I do think this is how we operate. Depending on our state of metabolism and other psychological factors, the dominant version changes but as a whole we remain the sum total of all these versions.

bamboozled · on April 7, 2024

So there is a number, let's say 42, 42 versions of us running in parallel make consciousness, reasoning and all other useful abilities appear?

suriya-ganesh · on April 7, 2024

More like a number within Miller's law[1].

https://en.wikipedia.org/wiki/Miller%27s_law#:~:text=The%20o...

jondwillis · on April 7, 2024

I was working on multi-agent systems for problem solving via https://github.com/agi-merge/waggle-dance for months last year!

zarathustreal · on April 7, 2024

“Each fine tuned toward slightly different aims”

So…a sort of mixture of experts if you will

kromem · on April 7, 2024

Kind of. More like a mixture of a mixture of experts.

The problem is MoE on its own isn't able to use the context as a scratch pad for differentiated CoT trees.

So you have a mixture of token suggestions, but a singular chain of thought.

A mixture of both is probably going to perform better than just a mixture of the former, especially given everything we know by now regarding in context learning or the degree of transmission synthetic data is carrying.

j45 · on April 7, 2024

It seems funny that the researchers are studying what people are building to experiment. crewAI is one example.

swagmoney1606 · on April 8, 2024

I've had a POC agent-chain thing written in haskell for about 9 months. this paper is a little different though.

infogulch · on April 7, 2024

This seems related to an interesting recent ACM ByteCast podcast episode with Edward Chang, an Adjunct Professor in the Department of Computer Science at Stanford University. [1] (Note there is a transcript if you don't want to listen.)

The approach he uses is to arrange for multiple LLMs to dialogue between each other about a discussion topic where the human acts as a moderator instead of the question/answer format that LLMs commonly take today. They find that the final answer that multiple LLMs come to in dialogue results in a huge improvement in both precision and accuracy for the same resources.

[1]: https://learning.acm.org/bytecast/ep50-edward-y-chang

rahimnathwani · on April 7, 2024

This paper suggests that you don't need the debating part: just get LLM to work on the problem independently, and choose the most popular answer.

xcv123 · on April 7, 2024

The paper says that it enhances existing methods such as prompt engineering (chain of thought) and LLM debate. This agent method is orthogonal to LLM debate.

infogulch · on April 7, 2024

Interesting. Somehow it seems odd to add randomness (temperature) and then wash it away by averaging it out.

naasking · on April 7, 2024

In optimization problems, randomness can often get you out of local minima/maxima, and so averaging out a bunch of random search paths might get you better results in the worst case. Something similar might be happening here. The training set will be biased in various ways that might create weird local min/max points and so this process could avoid those weird kinks.

ewild · on April 7, 2024

temp applies to each token so the range of temperature is significantly larger than the average being pulled

swagmoney1606 · on April 8, 2024

I built something like this in Haskell! I never benchmarked it, but I actually found it quite compelling. I would define each agent as a different "expert" in a subdomain of mathematics for example: proof theorist, abstract algebraic expert, etc.

I found it helpful, but the signal to noise ratio was high, lots of agents restating points etc.

j45 · on April 7, 2024

Is this effectively describing something like crewai?

nicklecompte · on April 7, 2024

One frustration I've had with all this mixture-of-experts research:

Randomized Algorithms 101 - or basic stochastic reasoning - suggests that if the temperature parameter is > 0, querying an LLM N times and picking the majority result (perhaps with an N+1th query to the LLM) will generally result in better performance than asking it once and choosing that result.

It seems plausible to me that the gains can be further improved with a specialized mixture of different LLMs (which could then be run at temp = 0), or by finding better ways to break tasks into subtasks as this paper suggests. But AFAICT nobody has done anything to actually quantify these hypothetical gains versus the dumb randomized algorithm approach! In particular there might be voting strategies or mixtures - even specific models - where MoE/etc is strictly worse than naive repetition.

I am a concerned citizen w.r.t LLMs rather than a researcher, so I might be missing something. It just seems odd that LLM researchers forgot the first chapter of Motwani/Raghavan.

benaubin · on April 7, 2024

I'd assume that there's a difference between picking the best _token_ across an assortment of randomly selected tokens, versus picking the best _string_ of randomly-selected tokens.

skybrian · on April 7, 2024

Eyeballing the graphs, it seems that most of the gain is with 10 agents, a bit more with 20, and there are diminishing returns after that. Apparently, more agents isn't going to do it.

nico · on April 6, 2024

They have a public repo: https://anonymous.4open.science/r/more_agent_is_all_you_need...

Prompts they used for the benchmarks: https://anonymous.4open.science/r/more_agent_is_all_you_need...

Super interesting. It would be cool to see something like this, but benchmarking LLM-based agents using a set of tools

trash_cat · on April 7, 2024

Is this not an incredibly expensive/unsustainable method? I agree with the sentiment that MoE is the way to go as the newer models will probably see diminishing returns. But the compute for a single prompt will suddenly increase 7-15 fold?

ukuina · on April 7, 2024

If GPT4 is 20x the price of GPT3.5, but it only takes 10x GPT3.5 runs to get similar quality of response (and likely faster), you'll still come out ahead.

dimask · on April 7, 2024

I doubt that 10xGPT3.5 > GPT4. There are a lot of tasks that GPT4 can do and GPT3.5 just cannot. Also, in such cases I find that GPT3.5's hallucinations are quite consistent, so such a method is probably not gonna help.

bearjaws · on April 7, 2024

"all you need is a six figure OpenAI bill"

Garlef · on April 7, 2024

if it replaces 2-3 experts that might be a good deal

ZachSaucier · on April 7, 2024

Also the amount of non-renewable resources used and emissions goes up 7-15x

imtringued · on April 7, 2024

So what? It's not like GPUs are compute starved.

bigEnotation · on April 7, 2024

Yeah, see GPT 3.5 vs GPT 4 pricing.

atum47 · on April 7, 2024

If we sum all those "x is all you need" we're going to realize that we need a lot of things

temporarely · on April 7, 2024

Just reading the (current top) few comments and whimsically wondered at the super business model of companies offering LLM services: a car service that won't get you from point A to B unless you hail it n times. A detergent that must be applied n times before cloths come out ("probably") clean.

If a company is offering "Artificial intelligence" at a price, then isn't reasonable that you only pay for correct answers? If the company is offering car service, shouldn't you only pay if they take you to your destination?

CSSer · on April 7, 2024

Agreed, and if it fails often enough, isn’t the bar at which a human or general-purpose, traditionally structured automation is going to be superior pretty low? This is how I think this bubble will pop. No doubt, LLMs are a breakthrough tool, but I’m sincerely skeptical of all but the most granular applications.

Perhaps the moral is that diffusing LLM agent accountability has the same failure model as the pre-existing human one.

samus · on April 7, 2024

Companies usually offer a service or a product. If the company doesn't deliver what was agreed upon, then the customer can demand correction. If a taxi driver takes a needlessly convoluted route, charges too much, or doesn't bring you to the destination, you can complain to the taxi company. If the laundry didn't work, you insist on doing it again.

However, many activities are inherently fraught with risk or uncertain results since there are always things outside of anyone's control. A lawyer can't promise you prevail in a court case, but they have to advocate your case to the best of their abilities. A doctor won't guarantee that you become healthy again. No taxi driver will guarantee you that you will reach the destination in time, but they will bring you there. Atlassian won't guarantee you will meet a release deadline if you use their managed JIRA instance, but they will do their best to prevent data loss. And a company that basically sells access to a chatbot won't guarantee that it gives you correct results. Maybe availability guarantees.

nequo · on April 7, 2024

Counterpoint: National Weather Service forecasts don’t always turn out to be correct but we don’t only pay the NWS on days when they are correct.

hopfenspergerj · on April 7, 2024

Ensemble of any number GPT 3.5 agents is less accurate than one call to GPT-4.

trash_cat · on April 7, 2024

It's funny because GPT-4 is actually a pile of 3.5s. You just need to set it up correctly.

jamala1 · on April 7, 2024

I guess it's the difference between an ensemble and a mixture of experts, i.e. aggregating outputs from (a) model(s) trained on the same data vs different data (GPT-4). Though GPT-4 presumably does not aggregate, but it routes.

BinRoo · on April 7, 2024

> GPT-4 is actually a pile of 3.5s

I understand the intension and reference you're making. I bet the implementation of GPT-4 is probably something along those lines. However, spreading speculation in definitive language like that when the truth is unknown is dishonest, wouldn't you agree?

trash_cat · on April 7, 2024

Sure, I could it put it less definitively, but realistically, what else can it be? The transformer won't change much and all of the models, at the core use it. It's a closely guarded secret because it's easy to replicate.

TheCaptain4815 · on April 7, 2024

Recommend anyone interested in a agent framework lookup AutoGen by Microsoft

ukuina · on April 7, 2024

This paper is specifically disproving the efficacy of agentic frameworks like AutoGen.

Also, the built-in function-calling in GPT4 is simpler to use than AutoGen2's abstraction.

ShamelessC · on April 7, 2024

This usage of the word "agent" when they simply mean "another LLM" is sort of nonstandard, no? To me an agent implies some degree of RL.

Buttons840 · on April 7, 2024

So this is an ensemble of many LLMs?

I wonder how well a bunch of LLMs trained on personal computers, so fairly small, could perform together?

Train a LLM on your emails, train an LLM on a text book, download a bunch of arbitrary LLMs from the net you find interesting, throw them all together into a big pile, and use a moderator LLM that knows how to format their output into an assistant format.

So, the email LLM would try to autocomplete sentences from your emails, and the text book LLM would try to autocomplete sentences from the text book. People could offer LLMs to download, almost as a way of compressing information, download the LLM of your favorite programming language, and TV series, etc. The important part would be having a moderator algorithm that can shape these LLMs from dumb sentence autocompleters (barely more than a fancy Markov chain) into a coherent assistant format. For example, the text book LLM would just endlessly spew semi-random sentences from the text, but a good moderator algorithm could see that it has sufficiently answered the question and cut it off.

In short, it's interesting that separate LLMs can integrate with each other and strengthen each other and it makes me wonder if we could build modular LLMs.

mofosyne · on April 7, 2024

Your idea inspired me to see what such a microstory based on your idea would look like (Of course generated by ChatGPT3.5):

> As I delved into my computer, eager to tackle my to-do list, I was met with an unexpected sight: a digital love triangle among the Language Models (LLMs). The Email LLM, with its quick wit, seemed to be engaging in flirtatious banter with the verbose Textbook LLM, while the Programming Language LLM watched on with amusement. I couldn't help but laugh at the absurdity of it all, but as the bickering between the LLMs intensified, I realized their antics were hindering my progress. With a mixture of frustration and amusement, I gently redirected the LLMs back to their intended purpose, finally able to accomplish my task amidst the chaotic comedy within my computer.

kshitij_libra · on April 7, 2024

This trend really needs to die. If you can’t come up with an original paper name , maybe the contents aren’t that original either

al2o3cr · on April 7, 2024

35x more reasoning for a 10% increase in accuracy?

Scaling with a scale factor of almost zero is still scaling, I guess.

yantrams · on April 7, 2024

This is my go to method for pretty much every hard problem that I'm forced to solve where I don't have the domain expertise / interest / time. The trick lies in coming up with a clever similarity metric that incorporates penalties etc. You can even go a level deeper and use multiple similarity algorithms and then poll on top of them. Here's a taxonomy extractor for text that I made using similar principles that is surprisingly as good as anything else that I've seen - https://dash.scooptent.com/text

Havoc · on April 7, 2024

Here is a link to the main diagram: https://anonymous.4open.science/r/more_agent_is_all_you_need...

Seems like a pretty brute force approach of frankly just throwing more compute at the query (via semi-statistical means).

I'd be more interested in how to scale this via different agents. i.e. do we have say one type of agent that is specialized to produce ideas, while another is trained to evaluate ideas. Those sort of chains seem like they'd be powerful - if you can find a way to generalize it

bearjaws · on April 7, 2024

I'm sure all the AI companies love the idea of running the same prompt 8 times...

datascienced · on April 7, 2024

NVDA does for sure

bwnjnoei · on April 7, 2024

Model ensemble is a classic method. Deep learning is always rediscovering and reinventing the classics. I believe that many people have used this method before, but they just haven't published a paper on it.

msoad · on April 7, 2024

What's interesting is that each run of the model tends to converge to a different "local maximum" in the solution space, and some of these local maxima correspond to better performance than others. By running the model multiple times, we increase the chances of finding a higher-quality local maximum or even the absolute best solution.

This got me thinking: why is this ensembling step implemented as a higher-level abstraction on top of the base LLM, rather than being built directly into the neural network architecture and training process itself?

sp332 · on April 7, 2024

Well you’re right that LLM tooling is totally inadequate. At least we already have beam search. But the more boring answer (and why beam search is also uncommon) is that running the query multiple times is more expense.

adamlp · on April 7, 2024

https://anonymous.4open.science/api/repo/more_agent_is_all_y...

How can you have majority voting based on text produced by LLMs, won’t every answer string be essentially distinct with extremely high probability if it’s over a couple dozens bits of output (which any useful LLM output would be, right?)?

mateiz · on April 8, 2024

This is a cool paper showing there is value in using an LLM multiple times, but in recent research we showed that with majority voting, quality can decrease past some point as you make more calls. Check out https://arxiv.org/pdf/2403.02419.pdf. It raises the natural question of how to design the best inference algorithm given an LLM you can call multiple times.

edshiro · on April 7, 2024

I've saved the paper to read it later.

The premise of this work seems very interesting... But I wonder how practical it is from both a cost and time perspective. I am toying around with an AI Agents library and one of the annoying UX things I notice is the time it takes to get my answers, because each call to an agent (either GPT-4 or Claude 3) is kinda slow.

Besides the time, it feels quite wasteful token wise.

I'm skeptical this approach will be adopted by many in the AI Agent space, but of course I could be very wrong.

temporarely · on April 7, 2024

Reading the paper the thought occurs that IF the likelihood of correct response increases by the number of agents employed AND this involves application of a function (whatever) to select the 'best' from the possible answers, doesn't that imply that LLM has insufficient dimensions?

In other words, I am wondering if LLM hallucinations [sic] are in fact symptomatic of 'conflation' which could itself be the result of insufficient dimensions.

Thoughts?

zone411 · on April 7, 2024

This is quite interesting because I've specifically tried this kind of basic ensembling for my NYT Connections benchmark and it didn't work. This is something everybody would try first before more complicated multi-step prompting, and yet since ChatGPT 3.5 I'm not aware of any papers showing that it works. It will be interesting to reproduce this result and learn more about how they set it up to make it work.

whiteandnerdy · on April 7, 2024

I remember hearing that Beam Search doesn't work well for LLMs, because it leads to repetitive, generic output.

The majority vote sampling technique in this paper sounds like it'd give similar output to Beam Search, because it's sampling sequences of tokens from a joint distribution. So why doesn't it give repetitive output like Beam Search does? What am I missing?

sheepscreek · on April 7, 2024

Having given this problem a great deal of thought, I have developed a strong intuition around this. I believe not only is AGI feasible, it is already doable.

For example, several hundred GPT-4 based agents specializing in different skill sets should be able to collaboratively solve many problems. Their ability to work on so many facets of the same problem will make them very effective against multidisciplinary problems.

What’s the catch? Well, the back and forth has to play out in a serial order, so it cannot be parallelized. At today’s abysmal inference speeds, it may take this AGI many times longer than a trained human. Now imagine the effectiveness of this method when we can speed up inference to several hundred times a minute. Now AGI suddenly becomes way more efficient than a human.

29athrowaway · on April 7, 2024

There's a paper about that

https://www.microsoft.com/en-us/research/publication/sparks-...

intended · on April 7, 2024

Definition of AGI at play here?

m3kw9 · on April 7, 2024

All these AI researchers treating “..is all you need” a meme or something.

Intralexical · on April 7, 2024

I think they've been overfitted.

latentsea · on April 7, 2024

All you need is all you need.

phkahler · on April 7, 2024

Does it take less compute to train N agents vs one large model? Seem like a big win. Can the majority of the training be done independently or in distributed fashion?

krystofee · on April 7, 2024

this study's got me worried. it goes against the hope that AGI won't just sneak onto the internet and do its own thing. with more work on making LLMs bigger and showing they can do more, the thought that this could actually happen in the future gets real scary, especially since i can run smaller versions on my own pc now.

parentheses · on April 7, 2024

I wonder if this performs even better when this is done tokenwise in the inner loop of the LLM

bigEnotation · on April 7, 2024

I thought this is what GPT 4 was, it uses a boosting algorithm over GPT 3.5?

etamponi · on April 7, 2024

Isn't this the same as lowering the temperature of the LLM?

dindobre · on April 7, 2024

Smells similar to Sutton's bitter lesson!

Takennickname · on April 7, 2024

The arrogance to name your paper that.

AnimalMuppet · on April 7, 2024

Echoes of Society of Mind by Marvin Minsky.

rubslopes · on April 7, 2024

I don't know why you were downvoted. Although Minsky's work has been highly criticized, it also has been influential in AI research.

spencerchubb · on April 7, 2024

This trend of "All You Need" in paper titles needs to die. The original "Attention is All You Need" used that title because it is literally true in their case. So many papers just use it as a meme now, and it distracts from the true insight of the paper.

donovanr · on April 7, 2024

it's uncreative/tired but at least to the point. Too many papers are confused / opaque agglomerations of a year's worth of research shoehorned into a paper. At least with these you can fairly easily assess whether the claim is supported or not.

ravelfan · on April 7, 2024

Obviously, more people are going to read it.

It is like putting a stupid face on your youtube video to show how shocked and amazed you are at the content.

jordanpg · on April 7, 2024

I think more or less the same thing about the word "meme" as you have used it.

spencerchubb · on April 7, 2024

I feel I used the word 'meme' appropriately

"A meme is an idea, behavior, or style that spreads by means of imitation from person to person within a culture and often carries symbolic meaning representing a particular phenomenon or theme."

Paper authors use "All You Need" to allude to the well-known transformer paper, even if their proposed technique is not in fact all you need.

jordanpg · on April 7, 2024

You used the word in the sense it commonly used, which is something like "often-repeated" or, more frequently, "shared a lot on social media."

While I agree that this is consistent with the WP definition in the broadest sense, it isn't really what Dawkins had in mind in the Selfish Gene.

spencerchubb · on April 7, 2024

My understanding is that a meme spreads through memetics similar to how a gene spreads through genetics. Ideas will spread that are fit for reproduction, i.e. communication.

Oftentimes memes are powerful at reproducing, but in science we don't want ideas that are likely to spread. We want ideas that are truthful

Terr_ · on April 7, 2024

"All You Need Considered Harmful."

naruhodo · on April 7, 2024

What if "Considered Harmful" is considered harmful?

benaubin · on April 7, 2024

"'All You Need Considered Harmful.' Is All You Need"

nickpsecurity · on April 7, 2024

“Make All You Need Great Again Considered Harmful.”

soulofmischief · on April 7, 2024

It's equally annoying to see this comment crop up every single time someone uses the phrase.