Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What is 10 + 10?

Answer: 0 + 10 = 10 + 10 = 10 + 10 = 10 + 10 = 10 + 10 =

This seems like a waste of compute and time.



Throwing math problems at a LLM just shows your level of understanding about the basics of LLMs. They're not trained to solve straight math calculations. I'm guess you could train one to be, I suppose, but the ones being released today are not.

You could instead ask it how to calculate something, and it could give you accurate instructions for how to achieve that. Then you either perform the calculation yourself, or use something like ChatGPT that has a built in python evaluator, so it can perform the calculation.

Quick example: https://chat.openai.com/share/9f76f5e5-d933-48fb-99e8-4a6530...


Or combine it with something like llama.cpp's grammer or microsoft's guidance-ai[0] (which I prefer) which would allow adding some react-style prompting and external tools. As others have mentioned, instruct tuning would help too.

[0] https://github.com/guidance-ai/guidance


You've actually shown your poor understanding of LLMs. I just asked Llama-2 7b the same question and it answered perfectly fine. It did not need to use an external python interpreter or a function call, or need to be prompted with chain of thought reasoning.

You're correct, LLMs are not (usually) explicitly trained to solve math calculations, this does not mean they cannot solve basic math equations (they can!).


LLMs don't solve basic math equations. They can pattern match on some aspects, but it's not calculation. Try random numbers for the sum and you'll find examples where it fails. Especially with longer numbers with repeating digits.


>They can pattern match on some aspects, but it's not calculation

Oh ? So what is it then ? Magic ? When you give GPT-4 random multidigit arithmetic that would not have appeared in it's dataset and it's more accurate than you can manage without a calculator, what is that ?

"Pattern matching" has really lost all meaning.


Sums can be solved at a character level (you can only carry 10, so that's doable). Anything more complicated solved at a generic level would require the network to either model the operation itself, or work recursively. So you either get an imprecise floating point math with noise from the model itself, or a recursion limit that works mostly on integers.

Basically, if you can write the list of operations and base it on digits, the network can learn to replicate that. Actual calculation on values - not really. You can tell which mode is used by using large random numbers and seeing what kind of error happens. Missed digits - it's not actually doing the calculation.

Also, we don't know how much gpt4 cheats at math. OpenAI may be replacing those bits with external calculation.


>Actual calculation on values - not really.

Uh yes really. https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...

>You can tell which mode is used by using large random numbers and seeing what kind of error happens. Missed digits - it's not actually doing the calculation.

Missed digits doesn't mean anything other than the wrong calculation. The assertion that it isn't doing calculation is absurd.

>Also, we don't know how much gpt4 cheats at math. OpenAI may be replacing those bits with external calculation

Pick one. Either it's so bad it's obviously not doing any calculations at all or it's so good you suspect Open AI are passing numbers through a calculator behind the scenes. Not only is the latter baseless speculation, the 2 options are mutually exclusive.


> Uh yes really.

This is based on integer addition up to 113 with a data set that only trains that. Yes, you can achieve that in a special case. No, LLMs don't achieve that.

Conclusion from the post you linked: "Epistemic status: I feel confident in the empirical results, but the generalisation to non-toy settings is more speculative"

> Pick one. Either it's so bad it's obviously not doing any calculations at all or it's so good you suspect Open AI are passing numbers through a calculator behind the scenes.

That's not the claim. I'm saying LLMs don't do real, precise calculation, but can do digit level operation and some implementations of systems around an LLM could pass the problem through a calculator. LLMs can rewrite the problem into a function, but the execution of it doesn't happen inside LLMs.

If you squint it's like the json output - LLMs can kind of do it most of the time, but you can implement a system around the LLM which ensures that any json output will be valid.


I chose a simple math equation that should almost assuredly be in a dataset of 1 trillion tokens exactly to check its basic pattern matching skills. The token "10" and "20" are even in its hardcoded vocabulary set. I think GPT-4 might be exceeding HN discourse on ML at this point.


I guess we have a disagreement about what "solve basic math equations" actually means.


The linked page claims that LiteLlama scored a zero on the GSM8K benchmark, so let's just say math probably isn't its forte.


Pretty sure your computer can calculate 10+10 without an LLM.


It's a test case.


It's a bad test case. You should know why you're choosing a given test. I hope people will understand why "solve this equation" doesn't work with LLMs and why "transform this into python code" does much better.


Why can't it transform 10 + 10 into 20 (equivalency) if it can do so with code?


Addition is a different problem for LLMs than code solving, since the latter is easier to memorize.


Why not both? They're distinct enough, and there's no such thing as too many distinct test cases.


Just need some prompt. Give it the following

    Q: 1+1
    A: 2
    
    Q: 1+2
    A: 3
    
    Q: 3+2
    A: 5
    
    Q: 9+8
    A: 17
    
    Q: 10+2
    A: 12
    
    Q: 10+10
and it spits out

    A: 20
    
    Q: 20+10
    A: 30
Not too bad.


The model may not be fine-tuned for instruct/chat.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: