Mixtral works great at 3-bit quantization. It fits onto a single RTX 3090 and runs at about 50 tokens/s. The output quality is not "ruined" at all.
For the amount of money you're talking about, you could also buy two 3090s (~$750 each on eBay) and have 48GB of VRAM to run with less quantization at full speed.
M-series Macs are surprisingly flexible platforms, but they're not "the only" consumer platform that can do Mixtral.
That was my experience as well - 3-bit version is pretty good.
I also tried 2-bit version, which was disappointing.
However, there is a new 2-bit approach in the works[1] (merged yesterday) which performs surprisingly well for Mixtral 8x7B Instruct with 2.10 bits per weight (12.3 GB model size).
I could only run 2-bit q2 mode on my 32G M2 Pro. I was a little disappointed, but I look forward to try the new approach you linked. I just use Mistral’s and also a 3rd party hosting service for now.
After trying the various options for running locally, I have settled on just using Ollama - really convenient and easy, and the serve APIs let me use various LLMs in several different (mostly Lisp) programming languages.
With excellent resources from Hugging Face, tool providers, etc., I hope that the user facing interface for running LLMs is simplified even further: enter your hardware specs and get available models filtered by what runs on a user’s setup. Really, we are close to being there.
Off topic: I hope I don’t sound too lazy, but I am retired (in the last 12 years before retirement I managed a deep learning team at Capital One, worked for a while at Google and three other AI companies) and I only allocate about 2 hours a day to experiment with LLMs so I like to be efficient with my time.
Ollama[1] + Ollama WebUI[2] is a killer combination for offline/fully local LLMs. Takes all the pain out of getting LLMs going. Both projects are rapidly adding functionality including recent addition of multimodal support.
You should be able to run Q3 and maybe even Q4 quants with 32GB. Even with the GPU as you can up the max RAM allocation with:
'sudo sysctl iogpu.wired_limit_mb=12345'
That is a very interesting discussion. Weird to me that the quantization code wasn’t required to be in the same PR. Ika is also already talking about a slightly higher 2.31bpw quantization, apparently.
So you don't see significantly worse performance on 3bit quantized models compared to 4? Every 7/13b model I tried quantized gave much worse responses at 3 bit and below, whereas the differences from 4 bit to 6 or even 8 bit is more subtle.
Mixtral is a larger model, so maybe that makes it more tolerant of that level of quantization? I’ve been impressed with 3-bit Mixtral, but I haven’t done a ton of side by sides against 4-bit because I haven’t felt the need.
Fair enough. I did put 'ruining' in quotes for a reason - I haven't compared output between Q3 and Q4_K_M that I use, but you do generally sacrifice output quality at higher quantization levels.
And you're right, you can run it on a multi-GPU setup if you're so inclined.
You can also choose to run at 4-bit quantization, offloading ~27 out of 33 layers to the GPU, and that runs at about 25 tokens/s for me. I think that's about the same speed as you get out of an M1 Max running at 4 bits? Although I'm not sure about the newer M2 or M3 Max chips. Googling around, I didn't immediately see clear benchmarks for those.
Just as another data point, a CPU-only setup with Q5_K_M would give you roughly 4 tokens per second on a Ryzen laptop (Dell Inspiron 7415 upgraded to 64 GB of RAM).
Nice - that's still pretty solid.. although on a more typical 3060 or 3070 with less vram available, I probably wouldn't expect numbers quite that good.
My 14" M1 Max does around 30t/s on Mixtral Q4_K_M.
I don't see any reason to believe this new access point won't support standalone, so I just ordered one to replace an older access point (an Amplifi Alien that was in Bridge Mode).
The access point should do nothing except convert between wired and wireless, which is all I ask. I have a separate, wired-only router.
Just to report back on this in case anyone finds this comment in the future... standalone mode almost worked.
On the UniFi U7 Pro, standalone mode was limited to 40MHz channel width on 5GHz and 160MHz channel width on 6GHz, which was a dealbreaker for me. The hardware is capable of 240MHz on 5GHz and 320MHz on 6GHz. So, I installed the controller on a local server and used that to configure the hardware to its fullest potential.
EDIT: Attempting to converse with any Q4_K_M 7B parameter model on a 15 Pro Max... the phone just melts down. It feels like it is producing about one token per minute. MLC-Chat can handle 7B parameter models just fine even on a 14 Pro Max, which has less RAM, so I think there is an issue here.
EDIT 2: Even using StableLM, I am experiencing a total crash of the app fairly consistently if I chat in one conversation, then start a new conversation and try to chat in that. On a related note, since chat history is saved... I don't think it's necessary to have a confirmation prompt if the user clicks the "new chat" shortcut in the top right of a chat.
-----
That does seem much nicer than MLC Chat. I really like the selection of models and saving of conversations.
Microsoft recently re-licensed Phi-2 to be MIT instead of non-commercial, so I would love to see that in the list of models. Similarly, there is a Dolphin-Phi fine tune.
The topic of discussion here is Mistral-7B v0.2, which is also missing from the model list, unfortunately. There are a few Mistral fine tunes in the list, but obviously not the same thing.
I also wish I could enable performance metrics to see how many tokens/sec the model was running at after each message, and to see how much RAM is being used.
Wow, thanks so much for taking the time to test it out and share such great feedback!
Thrilled about all those developments! More model options as well as link-based GGUF downloads on the way.
On the 7b models: I’m very sorry for the poor experience. I wouldn’t recommend 7b over Q2_K at the moment, unless you’re on a 16GB iPad (or an Apple Silicon Mac!). This needs to be much clearer, as you observed the consequences can be severe. The larger models, and even 3b Q6_K can be crash prone due to memory pressure. Will work on improve handling of low level out-of-memory errors very soon.
Will also investigate the StableLM crashes, I’m sorry about that! Hopefully Testflight recorded a trace. Just speculating, it may be a similar issue to the larger models, due to the higher-fidelity quant (Q6_K) combined with the context length eventually running out of RAM. Could you give the Q4_K_M a shot? I heard something similar from a friend yesterday, I’m curious if you have a better time with that — perhaps that’s a more sensible default.
Re: the overly-protective new chat alert, I agree, thanks for the suggestion. I’ll incorporate that into the next build. Can I credit you? Let me know how you’d like for me to refer to you, and I’d be happy to.
Finally, please feel free to email me any further feedback, and thanks again for your time and consideration!
I just checked and MLC Chat is running the 3-bit quantized version of Mistral-7B. It works fine on the 14 Pro Max (6GB RAM) without crashing, and is able to stay resident in memory on the 15 Pro Max (8GB RAM) when switching with another not-too-heavy app. 2-bit quantization just feels like a step too far, but I’ll give it a try.
Regarding credit, I definitely don’t need any. Just happy to see someone working on a better LLM app!
FYI, just submitted a new update for review with a few small but hopefully noticeable changes, thanks in no small part to your feedback:
1. StableLM Zephyr 3b Q4_K_M is now the built-in model, replacing the Q6_K variant.
2. More aggressive RAM headroom calculation, with forced fallback to CPU rather than failing to load or crashing.
3. New status indicator for Metal when model is loaded (filled bolt for enabled, vs slashed bolt for disabled.)
4. Metal will now also be enabled for devices with 4GB RAM or less, but only when the selected model can comfortably fit in RAM. Previously, only devices with at least 6GB had Metal enabled.
The fallback does seem to work! Although the 4-bit 7B models only run at 1 token every several seconds.
I still wish Phi-2, Dolphin Phi-2, and TinyLlama-Chat-v1.0 were available, but I understand you have plans to make it easier to download any model in the future.
What does snappier even mean in this context? The latency from connecting to a server over most network connections isn’t really noticeable when talking about text generation. If the server with a beefy datacenter-class GPU were running the same Mistral you can run on your phone, it would be spitting out hundreds of tokens per second. Most responses would appear on your screen before you blink.
There is no expectation that phones will ever be comparable in performance for LLMs.
Mistral runs at a decent clip on phones, but we’re talking like 11 tokens per second, not hundreds of tokens per second.
Server-based models tend to be only slightly faster than Mistral on my phone because they’re usually running much larger, much more accurate/useful models. Models which currently can’t fit onto phones.
Running models locally is not motivated by performance, except if you’re in places without reliable internet.
These data center targeted GPUs can only output that many tokens per second for large batches. These tokens are shared between hundreds or even thousands of users concurrently accessing the same server.
That’s why despite these GPUs deliver very high throughput in tokens/second, responses do not appear instantly, and individual users observe non-trivial latency.
Another interesting consequence, running these ML models with batch size = 1 (when running on end-user computers or phones) is practically guaranteed to bottleneck on memory. Computation performance or tensor cores are irrelevant for the use case, the only number which matters is memory bandwidth.
For example, I’ve tested my Mistral implementation on desktop with nVidia 1080Ti versus laptop with Radeon Vega 7 inside Ryzen 5 5600U. The performance difference between them is close to 10x, because memory: 484 GB/second for GDDR5X in the desktop versus 50 GB/second for dual-channel DDR4-3200 in the laptop. This is despite theoretical compute performance only differs by the factor of 6.6, the numbers are 10.6 versus 1.6 TFlops.
> These data center targeted GPUs can only output that many tokens per second for large batches.
No… my RTX 3090 can output 130 tokens per second with Mistral on batch size 1. A more powerful GPU (with faster memory) should easily be able to crack 200 tokens per second at batch size 1 with Mistral.
At larger batch sizes, the token rate would be enormous.
Microsoft’s high performing Phi-2 model breaks 200 tokens per second on batch size 1 on my RTX 3090. TinyLlama-1.1B is 350 tokens per second, though its usefulness may be questionable.
We’re just used to datacenter GPUs being used for much larger models, which are much slower, and cannot fit on today’s phones.
I wonder are you using a quantized version of Mistral? NVidia 3090 has 936 GB/second memory bandwidth, so 150 tokens/second = 7.2 GB per token. In the original 16 bits format, the model takes about 13GB.
Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.
> I wonder are you using a quantized version of Mistral?
Yes, we’re comparing phone performance versus datacenter GPUs. That is the discussion point I was responding to originally. That person appeared to be asking when phones are going to be faster than datacenters at running these models. Phones are not running un-quantized 7B models. I was using the 4-bit quantized models, which are close to what phones would be able to run, and a very good balance of accuracy vs speed.
> Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.
I don’t agree… batching will increase latency slightly, but it shouldn’t affect throughput for a single session much if it is done correctly. I admit it probably will have some effect, of course. The point of batching is to make use of the unused compute resources, balancing compute vs memory bandwidth better. You should still be running through the layers as fast as memory bandwidth allows, not stalling on compute by making the batch size too large. Right?
We don’t see these speeds because datacenter GPUs are running much larger models, as I have said repeatedly. Even GPT-3.5 Turbo is huge by comparison, since it is believed to be 20B parameters. It would run at about a third of the speed of Mistral. But, GPT-4 is where things get really useful, and no one knows (publicly) just how huge that is. It is definitely a lot slower than GPT-3.5, which in turn is a lot slower than Mistral.
There’re other interesting graphs there, they also measured the latency. They found a very strong dependency between batch size and latency, both for first token i.e. pre-fill, and time between subsequent tokens. Note how batch size = 40 delivers best throughput in tokens/second for the server, however the first output token takes almost 4 seconds to generate, probably too slow for an interactive chat.
BTW, I used development tools in the browser to measure latency for the free ChatGPT 3.5, and got about 900 milliseconds till the first token. OpenAI probably balanced throughput versus latency very carefully because their user base is large, and that balance directly affects their costs.
The chart you pointed out is very interesting, but it largely supports my point.
The blue line is easiest to read, so let’s look at how the tokens/sec scale for a single user session as the batch size increases. It starts out at about 100 tokens/s for 5 users = 20 tokens/s/user. At the next point, it is about 19t/s/u. Beyond this point, we start losing some ground, but even by the final data point, it is still over 11t/s/u.
The throughput is affected by less than 2x even with the most unreasonably large batch size. (Unreasonable, because the time to first token is unacceptable for an interactive chat, as you pointed out.)
But, with a batch size that is balanced appropriately, the throughput for a single user session is effectively unchanged whether the service is batching at N=3 or N=10. (Or presumably N=1, but the chart doesn’t include that.) The time to first token is also a reasonable 1 second delay, which is similar to what OpenAI is providing in your testing.
So, with the right batching balance, batching increases the total throughput of the server, but does not affect the throughput or latency for any individual session very much. It does have some impact, of course. Model size and quantization seem to have a much larger impact than batching, from an end user standpoint.
This model does not appear to be fine-tuned for chat. I observed the same looping behavior with virtually any direct prompt.
If I prime it with a pattern of Q and A with several examples of a good question and a good answer, then a final question with a blank answer, it provides a more cohesive and less repetitious answer. To me, this indicates it just needs fine-tuning with an instruction dataset to perform more "normally".
Obviously, a model this tiny is going to have very little inherent knowledge, and it's probably not going to be great at anything other than speed.
For example, I asked Mixtral to generate 4 questions and short answers following a prompt format that I provided. Then I used that output as the prompt for LiteLlama along with a new question:
Q: What is the capital city of France?
A: The capital city of France is Paris, a global center for art, fashion, gastronomy, and culture. It is located in the north-central part of the country and has an estimated population of 12.4 million in its wider metropolitan area.
Q: Who was the first person to walk on the moon?
A: The first person to walk on the moon was Neil Armstrong, as part of the Apollo 11 mission on July 20, 1969. When he stepped onto the lunar surface, he famously said, "That's one small step for man, one giant leap for mankind."
Q: What is the chemical symbol for gold?
A: The chemical symbol for gold is Au, derived from its Latin name, aurum, which means 'shining dawn.' Gold is a transition metal that is bright, slightly reddish yellow, dense, malleable, and resistant to corrosion and tarnish.
Q: What is the tallest mountain in the world?
A: The tallest mountain in the world is Mount Everest, located on the border of Nepal and Tibet. It has a height of approximately 8,849 meters (29,032 feet) above sea level. The first recorded summit of Everest was achieved by Sir Edmund Hillary and Tenzing Norgay in 1953.
Q: What is the LHC? What has it accomplished?
A:
And this is the output I received from LiteLlama:
The Large Hadron Collider (LHC) is a particle accelerator at CERN in Geneva, Switzerland. It is a super-high-energy proton-proton collider that is designed to collide protons with protons at the LHC.
Q: [Omitted since I don't care. You would normally stop the model when it responds with a stop word like "Q:", which it did.]
Obviously that response is not going to blow anyone away, but it didn't get stuck in a loop. It knew that LHC stood for "Large Hadron Collider", and it also knew that CERN and Geneva were relevant.
Like a number of Google projects, this one lives on without any clear direction. It probably will get axed some day, but the technology in Teachable Machine today is so “old school” already that I don’t think it would be that hard for someone to recreate or improve upon.
Not the person you replied to, but… Microsoft apparently revealed that GPT-3.5 Turbo is 20 billion parameters. Gemini Pro seems to perform only slightly better than GPT-3.5 Turbo according to some benchmarks, and worse in others. If Gemini Pro is significantly larger than 20 billion, that would be embarrassing for Google. If it is significantly smaller, that would be good for Google.
It seems reasonable to me to assume it’s somewhere in the neighborhood of 20 billion, but I agree it is worthwhile to recognize that we don’t actually know.
I don't think it would necessarily be embarrassing for Google because Gemini Pro is multimodal, while GPT-3.5 Turbo is text-only. Given this difference it wouldn't seem too unrealistic to me if Gemini Pro was bigger, but it seems like we just don't know.
Even so, Google treats the Gemini Pro Vision model as a separate model from Gemini Pro, so it could have separate parameters that are dedicated to vision (like CogVLM does), and that wouldn’t impact the size of the model as far as text-tasks are concerned.
For the amount of money you're talking about, you could also buy two 3090s (~$750 each on eBay) and have 48GB of VRAM to run with less quantization at full speed.
M-series Macs are surprisingly flexible platforms, but they're not "the only" consumer platform that can do Mixtral.