You can see the various quantizations here, both for the 2B model and the 7B model. The smallest you can go is the q2_K quantization of the 2B model, which is 1.3GB, but I wouldn't really call that "functional". The q4_0 quantization is 1.7GB, and that would probably be functional.
The size of anything but the model is going to be rounding error compared to how large the models are, in this context.
What's the use case of models this small? Can you use the "knowledge" encoded in them and ask them questions and get relevant answers, or are they used as text processors to summarize documents etc?
Gemma 2B generation quality is excellent in my own very-biased opinion. I asked it to write a response to your comment:
--
Large language models (LLMs) have achieved significant progress in recent years, with models like GPT-3 and LaMDA demonstrating remarkable abilities in various tasks such as language generation, translation, and question answering.
However, 2b parameter models are a much smaller and simpler type of LLM compared to GPT-3. While they are still capable of impressive performance, they have a limited capacity for knowledge representation and reasoning.
Despite their size, 2b parameter models can be useful in certain scenarios where the specific knowledge encoded in the model is relevant to the task at hand. For example:
- Question answering: 2b parameter models can be used to answer questions by leveraging their ability to generate text that is similar to the question.
- Text summarization: 2b parameter models can be used to generate concise summaries of documents by extracting the most important information.
- Code generation: While not as common, 2b parameter models can be used to generate code snippets based on the knowledge they have learned.
Overall, 2b parameter models are a valuable tool for tasks that require specific knowledge or reasoning capabilities. However, for tasks that involve general language understanding and information retrieval, larger LLMs like GPT-3 may be more suitable.
--
Generated in under 1s from query to full response on together.ai
I'm not sure what you mean that it "forgot" about POST? Even as an experienced Go developer, I looked at the code and thought it would probably work for both GET and POST. I couldn't easily see a problem, yet I had not forgotten about POST being part of the request. It's just not an obvious problem. This is absolutely what I would classify as a "brain teaser". It's a type of problem that makes an interviewer feel clever, but it's not great for actually evaluating candidates.
Only on running the code did I realize that it wasn't doing anything to handle the problem of the request body, where it works on the first attempt, but the ReadCloser is empty on subsequent attempts. It looks like Phind-70B corrected this issue once it was pointed out.
I've seen GPT-4 make plenty of small mistakes when generating code, so being iterative seems normal, even if GPT-4 might have this one specific brain teaser completely memorized.
I am not at the point where I expect any LLM to blindly generate perfect code every time, but if it can usually correct issues with feedback from an error message, then that's still quite good.
This isn't a brain teaser at all. It's a direct test of domain knowledge/experience.
There are countless well-documented RoundTripper implementations that handle this case correctly.
This is the sort of thing you whip up in three minutes and move along. To me it seems like a perfect test of LLMs. I don't need an injection of something that's worse than stackoverflow polluting the code I work on.
You seem to be quoting ChatGPT, and you're not even specifying that it's ChatGPT-4, so I automatically assume ChatGPT-3.5, which hallucinates at an astonishing rate. Regardless, all current LLMs can hallucinate. As ChatGPT's disclaimer says, "ChatGPT can make mistakes. Consider checking important information."
Quoting the relevant statement: "American merchants pay, on average, 1.76% in interchange fees – compared to a 0.96% average in most European nations."
Most of the other sources I'm seeing are quoting even lower numbers. Transactions do seem to have less overhead in Europe than North America.
"Hallucination" is a widely understood and accepted term in the LLM industry. If you want to change that, replying to my comment doesn't seem to be the most effective place to start. I don't know that it's the best term, but it seems better than re-explaining the issue in detail every time.
"Incorrect output" is broader, encompassing other failure modes. If you ask an LLM to respond in JSON with a list of common foods, and it instead writes a paragraph of text that contains a list of common foods, then that would not qualify as a "hallucination" by my understanding of the accepted definition, but it is still "incorrect output".
"Incorrect output" is still correct to describe the cited behavior. "Hallucination" isn't. If a human behaves the same way when asked a question, we do not say he's hallucinating.
One way to devalue incorrect terminology is to call it out when you see it, and use something accurate instead. That's how people learn.
"If your business wants to accept credit cards, you’ll need to pay a fee. Interchange makes up the bulk of that cost which merchants pay, roughly 75% it."
It makes sense that people are focusing on the interchange fees.
I have not found a single source that indicates European merchants have fees that are comparable in scale to their North American counterparts. If you want to make that argument, you should find an actual source.
> Do North American consumers/merchants pay less in processing fees overall than European/Asian counterparts or is it roughly 1-3% all over the world?
The answer seems to be the opposite. North American merchants pay more.
And their locations will be remembered (by ID) if an app wants to offer persistent experiences at multiple locations. Not so sure about windows on visionOS.
“I’ve used it for hours at a time without any discomfort, but fatigue does set in, from the weight alone. You never forget that you’re wearing it.”
“In terms of resolution, Vision Pro is astonishing. I do not see pixels, ever. I see text as crisply as I do in real life. It’s very comfortable to read.”
I’ve been using AirPods Max for years now, which many people complain about being “heavy”, but I don’t mind them. I expect the weight will be fine, but it is obviously there, and like exercising any new muscle, it might take some time.
The great thing about gguf is that it will cross to system RAM if there isn't enough VRAM. It will be slower, but waiting a couple minutes for a prompt response isn't the worst thing if you are the type that would get use out of a local 70b parameter model. Then again, one could have grabbed 2x 3090s for the price of a 4090 and ended up with 48gb of VRAM in exchange for a very tolerable performance hit.
> The great thing about gguf is that it will cross to system RAM if there isn't enough VRAM.
No… that’s not such a great thing. Helpful in a pinch, but if you’re not running at least 70% of your layers on the GPU, then you barely get any benefit from the GPU in my experience. The vast gulf in performance between the CPU and GPU means that the GPU is just spinning its wheels waiting on the CPU. Running half of a model on the GPU is not useful.
> Then again, one could have grabbed 2x 3090s for the price of a 4090 and ended up with 48gb of VRAM in exchange for a very tolerable performance hit.
I agree with this, if someone has a desktop that can fit two GPUs.
Multi-GPU in desktop chassis gets crazy pretty quickly. If you don't care about aesthetics and can figure out both the power delivery and PCI-E lane situation, https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni... has an example that will make a Bitcoin mining rig look clean.
Water cooling can get you down to 2x slot height, with all of the trouble involved in water cooling. NVIDIA really segmented the market quite well. Gamers hate blower cards, but they are the right physical dimensions to make multi-GPU work well, and they are exclusively on the workstation cards.
The main benefit of a GPU in that case is much faster prompt reading. Could be useful for Code Llama cases where you want the model to read a lot of code and then write a line or part of a line.
> The great thing about gguf is that it will cross to system RAM if there isn't enough VRAM.
Then you can just run it entirely on CPU. There is no point to buy an expensive GPU to run LLMs to be bottlenecked by your CPU in the first place. Which is why I do not get so excited with these huge models, as they gain less traction as not as many people can run them locally, and finetuning is probably more costly too.
GGUF is just a file format. The ability to offload some layers to CPU is not specific to it nor to llama.cpp in general - indeed, it was available before llama.cpp was even a thing.
fortunately it will run on my UMA mac. it's made me curious what the trade offs are. Would I be better off with a 4090 or a Mac with 128+gig of uma memory
Well, the workstation-class equivalent of a 4090 -- RTX 6000 Ada -- has enough RAM to work with a quantized model, but it'll blow away anyone's budget at anywhere between $7,000 and $10,000.
Even the M3 Max seems to be slower than my 3090 for LLMs that fit onto the 3090, but it’s hard to find comprehensive numbers. The primary advantage is that you can spec out more memory with the M3 Max to fit larger models, but with the exception of CodeLlama-70B today, it really seems like the trend is for models to be getting smaller and better, not bigger. Mixtral runs circles around Llama2-70B and arguably ChatGPT-3.5. Mistral-7B often seems fairly close to Llama2-70B.
Microsoft accidentally leaked that ChatGPT-3.5-Turbo is apparently only 20B parameters.
24GB of VRAM is enough to run ~33B parameter models, and enough to run Mixtral (which is a MoE, which makes direct comparisons to “traditional” LLMs a little more confusing.)
I don’t think there’s a clear answer of what hardware someone should get. It depends. Should you give up performance on the models most people run locally in hopes of running very large models, or give up the ability to run very large models in favor of prioritizing performance on the models that are popular and proven today?
M3 Max is actually less than ideal because it peaks at 400 Gb/s for memory. What you really want is M1 or M2 Ultra, which offers up to 800 Gb/s (for comparison, RTX 3090 runs at 936 GB/s). A Mac Studio suitable for running 70B models with speeds fast enough for realtime chat can be had for ~$3K
The downside of Apple's hardware at the moment is that the training ecosystem is very much focused on CUDA; llama.cpp has an open issue about Metal-accelerated training: https://github.com/ggerganov/llama.cpp/issues/3799 - but no work on it so far. This is likely because training at any significant sizes requires enough juice that it's pretty much always better to do it in the cloud currently, where, again, CUDA is the well-established ecosystem, and it's cheaper and easier for datacenter operators to scale. But, in principle, much faster training on Apple hardware should be possible, and eventually someone will get it done.
Yep, I seriously considered a Mac Studio a few months ago when I was putting together an “AI server” for home usage, but I had my old 3090 just sitting around, and I was ready to upgrade the CPU on my gaming desktop… so then I had that desktop’s previous CPU. I just had too many parts already, and it deeply annoys me that Apple won’t put standard, user-upgradable NVMe SSDs on their desktops. Otherwise, the Mac Studio is a very appealing option for sure.
Quantization is highly effective at reducing memory and storage requirements, and it barely has any impact on quality (unless you take it to the extreme). Approximately no one should ever be running the full fat fp16 models during inference of any of these LLMs. That would be incredibly inefficient.
I run 33B parameter models on my RTX 3090 (24GB VRAM) no problem. 70B should easily fit into 64GB of RAM.
Can I ask how many tok/s you're getting on that setup? I'm trying to decide whether to invest in a high-end NVIDIA setup or a Mac Studio with llama.cpp for the purposes of running LLMs like this one locally.
On a 33B model at q4_0 quantization, I’m seeing about 36 tokens/s on the RTX 3090 with all layers offloaded to the GPU.
Mixtral runs at about 43 tokens/s at q3_K_S with all layers offloaded. I normally avoid going below 4-bit quantization, but Mixtral doesn’t seem phased. I’m not sure if the MoE just makes it more resilient to quantization, or what the deal is. If I run it at q4_0, then it runs at about 24 tokens/s, with 26 out of 33 layers offloaded, which is still perfectly usable, but I don’t usually see the need with Mixtral.
Ollama dynamically adjusts the layers offloaded based on the model and context size, so if I need to run with a larger context window, that reduces the number of layers that will fit on the GPU and that impacts performance, but things generally work well.
What's the power consumption and fan noise like when doing that? I assume you're running the model doing inference in the background for the whole coding session, i.e. hours at a time?
I don’t use local LLMs for CoPilot-like functionality, but I have toyed with the concept.
There are a few things to keep in mind: no programmer that I know is sitting there typing code for hours at a time without stopping. There’s a lot more to being a developer than just typing, whether it is debugging, thinking, JIRA, Slack, or whatever else. These CoPilot-like tools will only activate after you type something, then stop for a defined timeout period. While you’re typing, they do nothing. After they generate, they do nothing.
I would honestly be surprised if the GPU active time was more than 10% averaged over an hour. When actively working on a large LLM, the RTX 3090 is drawing close to 400W in my desktop. At a 10% duty cycle (active time), that would be 40W on average, which would be 320Wh over the course of a full 8-hour day of crazy productivity. My electric rate is about 15¢/kWh, so that would be about 5¢ per day. It is absolutely not running at a 100% duty cycle, and it’s absurd to even do the math for that, but we can multiply by 10 and say that if you’re somehow a mythical “10x developer” then it would be 50¢/day in electricity here. I think 5¢/day to 10¢/day is closer to reality. Either way, the cost is marginal at the scale of a software developer’s salary.
That sounds perfectly reasonable. I'm more worried about noise and heat than the cost though, but I guess that's not too bad either then. What's the latency like? When I've used generative image models the programs unload the model after they're done, so it takes a while to generate the next image. Is the model sitting in VRAM when it's idle?
Fan noise isn’t very much, and you can always limit the max clockspeeds on a GPU (and/or undervolt it) to be quieter and more efficient at a cost of a small amount of performance. The RTX 3090 still seems to be faster than the M3 Max for LLMs that fit on the 3090, so giving up a little performance for near-silent operation wouldn’t be a big loss.
Ollama caches the last used model in memory for a few minutes, then unloads it if it hasn’t been used in that time to free up VRAM. I think they’re working on making this period configurable.
Latency is very good in my experience, but I haven’t used the local code completion stuff much, just a few quick experiments on personal projects, so my experience with that aspect is limited. If I ever have a job that encourages me to use my own LLM server, I would certainly consider using it more for that.
It's a legit question, the model will be worse in some way... I've seen it discussed that all things being equal more parameters is better (meaning it's better to take a big model and quantized it to fit in memory than use a smaller unquantized model that fits), but a quantized model wouldn't be expected to run identically to or as well as the full model.
You don’t stop being andy99 just because you’re a little tired, do you? Being tired makes everyone a little less capable at most things. Sometimes, a lot less capable.
In traditional software, the same program compiled for 32-bit and 64-bit architectures won’t be able to handle all of the same inputs, because the 32-bit version is limited by the available address space. It’s still the same program.
If we’re not willing to declare that you are a completely separate person when you’re tired, or that 32-bit and 64-bit versions are completely different programs, then I don’t think it’s worth getting overly philosophical about quantization. A quantized model is still the same model.
The quality loss from using 4+ bit quantization is minimal, in my experience.
Yes, it has a small impact on accuracy, but with massive efficiency gains. I don’t really think anyone should be running the full models outside of research in the first place. If anything, the quantized models should be considered the “real” models, and the full fp16/fp32 model should just be considered a research artifact distinct from the model. But this philosophical rabbit hole doesn’t seem to lead anywhere interesting to me.
I don't like the metaphor: when I'm tired, I will be alert again later. Quantization is lossy compression: the human equivalent would be more like a traumatic brain injury affecting recall, especially of fine details.
The question of whether I am still me after a traumatic brain injury is philosophically unclear, and likely depends on specifics about the extent of the deficits.
The impact on accuracy is somewhere in the single-digit percentages at 4-bit quantization, from what I’ve been able to gather. Very small impact. To draw the analogy out further, if the model was able to get an A on a test before quantization, it would likely still get a B at worst afterwards, given a drop in the score of less than 10%. Depending on the task, the measured impact could even be negligible.
It’s far more similar to the model being perpetually tired than it is to a TBI.
You may nitpick the analogy, but analogies are never exact. You also ignore the other piece that I pointed out, which is how we treat other software that comes in multiple slightly different forms.
But we’re talking about a coding LLM here. A single digit percentage reduction in accuracy means, what, one or two times in a hundred, it writes == instead of !=?
I think that’s too simplified. The best LLMs will still frequently make mistakes. Meta is advertising a HumanEval score of 67.8%. In a third of cases, the code generated still doesn’t satisfactorily solve the problem in that automated benchmark. The additional errors that quantization would introduce would only be a very small percentage of the overall errors, making the quantized and unquantized models practically indistinguishable to a human observer. Beyond that, lower accuracy can manifest in many ways, and “do the opposite” seems unlikely to be the most common way. There might be a dozen correct ways to solve a problem. The quantized model might choose a different path that still turns out to work, it’s just not exactly the same path.
As someone else pointed out, FLAC is objectively more accurate than mp3, but how many people can really tell? Is it worth 3x the data to store/stream music in FLAC?
The quantized model would run at probably 4x the speed of the unquantized model, assuming you had enough memory to choose between them. Is speed worth nothing? If I have to wait all day for the LLM to respond, I can probably do the work faster myself without its help. Is being able to fit the model onto the hardware you have worth nothing?
In essence, quantization here is a 95% “accurate” implementation of a 67% accurate model, which yields a 300% increase in speed while using just 25% of the RAM. All numbers are approximate, even the HumanEval benchmark should be taken with a large grain of salt.
If you have a very opulent computational experience, you can enjoy the luxury of the full 67.8% accurate model, but that just feels both wasteful and like a bad user experience.
You can see the various quantizations here, both for the 2B model and the 7B model. The smallest you can go is the q2_K quantization of the 2B model, which is 1.3GB, but I wouldn't really call that "functional". The q4_0 quantization is 1.7GB, and that would probably be functional.
The size of anything but the model is going to be rounding error compared to how large the models are, in this context.