Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How? It's larger than 64GB.


Quantization is highly effective at reducing memory and storage requirements, and it barely has any impact on quality (unless you take it to the extreme). Approximately no one should ever be running the full fat fp16 models during inference of any of these LLMs. That would be incredibly inefficient.

I run 33B parameter models on my RTX 3090 (24GB VRAM) no problem. 70B should easily fit into 64GB of RAM.


Can I ask how many tok/s you're getting on that setup? I'm trying to decide whether to invest in a high-end NVIDIA setup or a Mac Studio with llama.cpp for the purposes of running LLMs like this one locally.


On a 33B model at q4_0 quantization, I’m seeing about 36 tokens/s on the RTX 3090 with all layers offloaded to the GPU.

Mixtral runs at about 43 tokens/s at q3_K_S with all layers offloaded. I normally avoid going below 4-bit quantization, but Mixtral doesn’t seem phased. I’m not sure if the MoE just makes it more resilient to quantization, or what the deal is. If I run it at q4_0, then it runs at about 24 tokens/s, with 26 out of 33 layers offloaded, which is still perfectly usable, but I don’t usually see the need with Mixtral.

Ollama dynamically adjusts the layers offloaded based on the model and context size, so if I need to run with a larger context window, that reduces the number of layers that will fit on the GPU and that impacts performance, but things generally work well.


What's the power consumption and fan noise like when doing that? I assume you're running the model doing inference in the background for the whole coding session, i.e. hours at a time?


I don’t use local LLMs for CoPilot-like functionality, but I have toyed with the concept.

There are a few things to keep in mind: no programmer that I know is sitting there typing code for hours at a time without stopping. There’s a lot more to being a developer than just typing, whether it is debugging, thinking, JIRA, Slack, or whatever else. These CoPilot-like tools will only activate after you type something, then stop for a defined timeout period. While you’re typing, they do nothing. After they generate, they do nothing.

I would honestly be surprised if the GPU active time was more than 10% averaged over an hour. When actively working on a large LLM, the RTX 3090 is drawing close to 400W in my desktop. At a 10% duty cycle (active time), that would be 40W on average, which would be 320Wh over the course of a full 8-hour day of crazy productivity. My electric rate is about 15¢/kWh, so that would be about 5¢ per day. It is absolutely not running at a 100% duty cycle, and it’s absurd to even do the math for that, but we can multiply by 10 and say that if you’re somehow a mythical “10x developer” then it would be 50¢/day in electricity here. I think 5¢/day to 10¢/day is closer to reality. Either way, the cost is marginal at the scale of a software developer’s salary.


That sounds perfectly reasonable. I'm more worried about noise and heat than the cost though, but I guess that's not too bad either then. What's the latency like? When I've used generative image models the programs unload the model after they're done, so it takes a while to generate the next image. Is the model sitting in VRAM when it's idle?


Fan noise isn’t very much, and you can always limit the max clockspeeds on a GPU (and/or undervolt it) to be quieter and more efficient at a cost of a small amount of performance. The RTX 3090 still seems to be faster than the M3 Max for LLMs that fit on the 3090, so giving up a little performance for near-silent operation wouldn’t be a big loss.

Ollama caches the last used model in memory for a few minutes, then unloads it if it hasn’t been used in that time to free up VRAM. I think they’re working on making this period configurable.

Latency is very good in my experience, but I haven’t used the local code completion stuff much, just a few quick experiments on personal projects, so my experience with that aspect is limited. If I ever have a job that encourages me to use my own LLM server, I would certainly consider using it more for that.


Thanks! That is really fast for personal use.


I run LLaMA 70B and 120B (frankenmerges) locally on a 2022 Mac Studio with M1 Ultra and 12Gb RAM. It gives ~7 tok/s for 120B and ~9.5 tok/s for 70B.

Note that M1/M2 Ultra is quite a bit faster than M3 Max, mostly due to 800 Gb/s vs 400 Gb/s memory


Here's an example of megadolphin running on my m2 ultra setup: https://gist.github.com/nullstyle/a9b68991128fd4be84ffe8435f...


I'm aware but is it still LLaMA 70B at that point?


It's a legit question, the model will be worse in some way... I've seen it discussed that all things being equal more parameters is better (meaning it's better to take a big model and quantized it to fit in memory than use a smaller unquantized model that fits), but a quantized model wouldn't be expected to run identically to or as well as the full model.


You don’t stop being andy99 just because you’re a little tired, do you? Being tired makes everyone a little less capable at most things. Sometimes, a lot less capable.

In traditional software, the same program compiled for 32-bit and 64-bit architectures won’t be able to handle all of the same inputs, because the 32-bit version is limited by the available address space. It’s still the same program.

If we’re not willing to declare that you are a completely separate person when you’re tired, or that 32-bit and 64-bit versions are completely different programs, then I don’t think it’s worth getting overly philosophical about quantization. A quantized model is still the same model.

The quality loss from using 4+ bit quantization is minimal, in my experience.

Yes, it has a small impact on accuracy, but with massive efficiency gains. I don’t really think anyone should be running the full models outside of research in the first place. If anything, the quantized models should be considered the “real” models, and the full fp16/fp32 model should just be considered a research artifact distinct from the model. But this philosophical rabbit hole doesn’t seem to lead anywhere interesting to me.

Various papers have shown that 4-bit quantization is a great balance. One example: https://arxiv.org/pdf/2212.09720.pdf


I don't like the metaphor: when I'm tired, I will be alert again later. Quantization is lossy compression: the human equivalent would be more like a traumatic brain injury affecting recall, especially of fine details.

The question of whether I am still me after a traumatic brain injury is philosophically unclear, and likely depends on specifics about the extent of the deficits.


The impact on accuracy is somewhere in the single-digit percentages at 4-bit quantization, from what I’ve been able to gather. Very small impact. To draw the analogy out further, if the model was able to get an A on a test before quantization, it would likely still get a B at worst afterwards, given a drop in the score of less than 10%. Depending on the task, the measured impact could even be negligible.

It’s far more similar to the model being perpetually tired than it is to a TBI.

You may nitpick the analogy, but analogies are never exact. You also ignore the other piece that I pointed out, which is how we treat other software that comes in multiple slightly different forms.


Reminds me of the never ending MP3 vs FLAC argument.

The difference can be measured empirically, but is it noticeable in real world usage.


But we’re talking about a coding LLM here. A single digit percentage reduction in accuracy means, what, one or two times in a hundred, it writes == instead of !=?


I think that’s too simplified. The best LLMs will still frequently make mistakes. Meta is advertising a HumanEval score of 67.8%. In a third of cases, the code generated still doesn’t satisfactorily solve the problem in that automated benchmark. The additional errors that quantization would introduce would only be a very small percentage of the overall errors, making the quantized and unquantized models practically indistinguishable to a human observer. Beyond that, lower accuracy can manifest in many ways, and “do the opposite” seems unlikely to be the most common way. There might be a dozen correct ways to solve a problem. The quantized model might choose a different path that still turns out to work, it’s just not exactly the same path.

As someone else pointed out, FLAC is objectively more accurate than mp3, but how many people can really tell? Is it worth 3x the data to store/stream music in FLAC?

The quantized model would run at probably 4x the speed of the unquantized model, assuming you had enough memory to choose between them. Is speed worth nothing? If I have to wait all day for the LLM to respond, I can probably do the work faster myself without its help. Is being able to fit the model onto the hardware you have worth nothing?

In essence, quantization here is a 95% “accurate” implementation of a 67% accurate model, which yields a 300% increase in speed while using just 25% of the RAM. All numbers are approximate, even the HumanEval benchmark should be taken with a large grain of salt.

If you have a very opulent computational experience, you can enjoy the luxury of the full 67.8% accurate model, but that just feels both wasteful and like a bad user experience.


Yes. Quantization does not reduce the number of parameters. It does not re-train the model.


Sure, quantization reduces information stored for each parameter, not the parameter count.


Quantization can take it under 30GB (with quality degradation).

For example, take a look at the GGUF file sizes here: https://huggingface.co/TheBloke/Llama-2-70B-GGUF




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: