What is 10 + 10? Answer: 0 + 10 = 10 + 10 = 10 + 10 = 10 + 10 = 10 + 10 = This s...

diggan · on Jan 7, 2024

Throwing math problems at a LLM just shows your level of understanding about the basics of LLMs. They're not trained to solve straight math calculations. I'm guess you could train one to be, I suppose, but the ones being released today are not.

You could instead ask it how to calculate something, and it could give you accurate instructions for how to achieve that. Then you either perform the calculation yourself, or use something like ChatGPT that has a built in python evaluator, so it can perform the calculation.

Quick example: https://chat.openai.com/share/9f76f5e5-d933-48fb-99e8-4a6530...

batch12 · on Jan 7, 2024

Or combine it with something like llama.cpp's grammer or microsoft's guidance-ai[0] (which I prefer) which would allow adding some react-style prompting and external tools. As others have mentioned, instruct tuning would help too.

[0] https://github.com/guidance-ai/guidance

jackblemming · on Jan 7, 2024

You've actually shown your poor understanding of LLMs. I just asked Llama-2 7b the same question and it answered perfectly fine. It did not need to use an external python interpreter or a function call, or need to be prompted with chain of thought reasoning.

You're correct, LLMs are not (usually) explicitly trained to solve math calculations, this does not mean they cannot solve basic math equations (they can!).

viraptor · on Jan 7, 2024

LLMs don't solve basic math equations. They can pattern match on some aspects, but it's not calculation. Try random numbers for the sum and you'll find examples where it fails. Especially with longer numbers with repeating digits.

famouswaffles · on Jan 8, 2024

>They can pattern match on some aspects, but it's not calculation

Oh ? So what is it then ? Magic ? When you give GPT-4 random multidigit arithmetic that would not have appeared in it's dataset and it's more accurate than you can manage without a calculator, what is that ?

"Pattern matching" has really lost all meaning.

viraptor · on Jan 8, 2024

Sums can be solved at a character level (you can only carry 10, so that's doable). Anything more complicated solved at a generic level would require the network to either model the operation itself, or work recursively. So you either get an imprecise floating point math with noise from the model itself, or a recursion limit that works mostly on integers.

Basically, if you can write the list of operations and base it on digits, the network can learn to replicate that. Actual calculation on values - not really. You can tell which mode is used by using large random numbers and seeing what kind of error happens. Missed digits - it's not actually doing the calculation.

Also, we don't know how much gpt4 cheats at math. OpenAI may be replacing those bits with external calculation.

famouswaffles · on Jan 8, 2024

>Actual calculation on values - not really.

Uh yes really. https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...

>You can tell which mode is used by using large random numbers and seeing what kind of error happens. Missed digits - it's not actually doing the calculation.

Missed digits doesn't mean anything other than the wrong calculation. The assertion that it isn't doing calculation is absurd.

>Also, we don't know how much gpt4 cheats at math. OpenAI may be replacing those bits with external calculation

Pick one. Either it's so bad it's obviously not doing any calculations at all or it's so good you suspect Open AI are passing numbers through a calculator behind the scenes. Not only is the latter baseless speculation, the 2 options are mutually exclusive.

viraptor · on Jan 8, 2024

> Uh yes really.

This is based on integer addition up to 113 with a data set that only trains that. Yes, you can achieve that in a special case. No, LLMs don't achieve that.

Conclusion from the post you linked: "Epistemic status: I feel confident in the empirical results, but the generalisation to non-toy settings is more speculative"

> Pick one. Either it's so bad it's obviously not doing any calculations at all or it's so good you suspect Open AI are passing numbers through a calculator behind the scenes.

That's not the claim. I'm saying LLMs don't do real, precise calculation, but can do digit level operation and some implementations of systems around an LLM could pass the problem through a calculator. LLMs can rewrite the problem into a function, but the execution of it doesn't happen inside LLMs.

If you squint it's like the json output - LLMs can kind of do it most of the time, but you can implement a system around the LLM which ensures that any json output will be valid.

jackblemming · on Jan 7, 2024

I chose a simple math equation that should almost assuredly be in a dataset of 1 trillion tokens exactly to check its basic pattern matching skills. The token "10" and "20" are even in its hardcoded vocabulary set. I think GPT-4 might be exceeding HN discourse on ML at this point.

viraptor · on Jan 8, 2024

I guess we have a disagreement about what "solve basic math equations" actually means.

coder543 · on Jan 7, 2024

The linked page claims that LiteLlama scored a zero on the GSM8K benchmark, so let's just say math probably isn't its forte.

badgersnake · on Jan 7, 2024

Pretty sure your computer can calculate 10+10 without an LLM.

minimaxir · on Jan 7, 2024

It's a test case.

viraptor · on Jan 7, 2024

It's a bad test case. You should know why you're choosing a given test. I hope people will understand why "solve this equation" doesn't work with LLMs and why "transform this into python code" does much better.

Solvency · on Jan 8, 2024

Why can't it transform 10 + 10 into 20 (equivalency) if it can do so with code?

minimaxir · on Jan 8, 2024

Addition is a different problem for LLMs than code solving, since the latter is easier to memorize.

minimaxir · on Jan 7, 2024

Why not both? They're distinct enough, and there's no such thing as too many distinct test cases.

jxy · on Jan 8, 2024

Just need some prompt. Give it the following

and it spits out

Not too bad.

DarmokJalad1701 · on Jan 7, 2024

The model may not be fine-tuned for instruct/chat.