Throwing math problems at a LLM just shows your level of understanding about the basics of LLMs. They're not trained to solve straight math calculations. I'm guess you could train one to be, I suppose, but the ones being released today are not.
You could instead ask it how to calculate something, and it could give you accurate instructions for how to achieve that. Then you either perform the calculation yourself, or use something like ChatGPT that has a built in python evaluator, so it can perform the calculation.
Or combine it with something like llama.cpp's grammer or microsoft's guidance-ai[0] (which I prefer) which would allow adding some react-style prompting and external tools. As others have mentioned, instruct tuning would help too.
You've actually shown your poor understanding of LLMs. I just asked Llama-2 7b the same question and it answered perfectly fine. It did not need to use an external python interpreter or a function call, or need to be prompted with chain of thought reasoning.
You're correct, LLMs are not (usually) explicitly trained to solve math calculations, this does not mean they cannot solve basic math equations (they can!).
LLMs don't solve basic math equations. They can pattern match on some aspects, but it's not calculation. Try random numbers for the sum and you'll find examples where it fails. Especially with longer numbers with repeating digits.
>They can pattern match on some aspects, but it's not calculation
Oh ? So what is it then ? Magic ?
When you give GPT-4 random multidigit arithmetic that would not have appeared in it's dataset and it's more accurate than you can manage without a calculator, what is that ?
Sums can be solved at a character level (you can only carry 10, so that's doable). Anything more complicated solved at a generic level would require the network to either model the operation itself, or work recursively. So you either get an imprecise floating point math with noise from the model itself, or a recursion limit that works mostly on integers.
Basically, if you can write the list of operations and base it on digits, the network can learn to replicate that. Actual calculation on values - not really. You can tell which mode is used by using large random numbers and seeing what kind of error happens. Missed digits - it's not actually doing the calculation.
Also, we don't know how much gpt4 cheats at math. OpenAI may be replacing those bits with external calculation.
>You can tell which mode is used by using large random numbers and seeing what kind of error happens. Missed digits - it's not actually doing the calculation.
Missed digits doesn't mean anything other than the wrong calculation. The assertion that it isn't doing calculation is absurd.
>Also, we don't know how much gpt4 cheats at math. OpenAI may be replacing those bits with external calculation
Pick one. Either it's so bad it's obviously not doing any calculations at all or it's so good you suspect Open AI are passing numbers through a calculator behind the scenes. Not only is the latter baseless speculation, the 2 options are mutually exclusive.
This is based on integer addition up to 113 with a data set that only trains that. Yes, you can achieve that in a special case. No, LLMs don't achieve that.
Conclusion from the post you linked: "Epistemic status: I feel confident in the empirical results, but the generalisation to non-toy settings is more speculative"
> Pick one. Either it's so bad it's obviously not doing any calculations at all or it's so good you suspect Open AI are passing numbers through a calculator behind the scenes.
That's not the claim. I'm saying LLMs don't do real, precise calculation, but can do digit level operation and some implementations of systems around an LLM could pass the problem through a calculator. LLMs can rewrite the problem into a function, but the execution of it doesn't happen inside LLMs.
If you squint it's like the json output - LLMs can kind of do it most of the time, but you can implement a system around the LLM which ensures that any json output will be valid.
I chose a simple math equation that should almost assuredly be in a dataset of 1 trillion tokens exactly to check its basic pattern matching skills. The token "10" and "20" are even in its hardcoded vocabulary set. I think GPT-4 might be exceeding HN discourse on ML at this point.
It's a bad test case. You should know why you're choosing a given test. I hope people will understand why "solve this equation" doesn't work with LLMs and why "transform this into python code" does much better.
Answer: 0 + 10 = 10 + 10 = 10 + 10 = 10 + 10 = 10 + 10 =
This seems like a waste of compute and time.