> I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.
No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?
I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.
A useless benchmark is useless.
In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.
You have to sign into the first one to do anything at all, and you have to sign into the second one if you want access to the new, larger 405B model.
Llama 3.1 is certainly going to be available through other platforms in a matter of days. Groq supposedly offered Llama 3.1 405B yesterday, but I never once got it to respond, and now it’s just gone from their website. Llama 3.1 70B does work there, but 405B is the one that’s supposed to be comparable to GPT-4o and the like.
I also use Capture One, and I actually liked it significantly better than Lightroom when I did a side by side comparison of them a couple of years ago.
Lightroom is starting to get some HDR processing capabilities that are interesting to me, but that one feature by itself isn't currently worth paying Adobe's crazy subscription prices just to use a program that I otherwise don't enjoy.
The option to pay is still listed as coming soon, but I also see pricing information in the settings page, so maybe it actually is coming somewhat sooner. I’m seeing $0.05/1M input and $0.10/1M output for llama3 8B, which is not exactly identical to what the previous person quoted.
Either way, I wish Groq would offer a real service to people willing to pay.
That quote is referring to the A100... the H100 used ~75% more power to deliver "up to 9x faster AI training and up to 30x faster AI inference speedups on large language models compared to the prior generation A100."[0]
Which sure makes the H100 sound both faster and more efficient (per unit of compute) than the TPU v4, given what was in your quote. I don't think your quote does anything to support the position that TPUs are noticeably better than Nvidia's offerings for this task.
Complicating this is that the TPU v5 generation has already come out, and the Nvidia B100 generation is imminent within a couple of months. (So, no, a comparison of TPUv5 to H100 isn't for a future paper... that future paper should be comparing TPUv5 to B100, not H100.)
The official Mistral-7B-v0.2 model added support for 32k context, and I think it's far better than MistralLite. Third-party finetunes are rarely amazing at the best of times.
Now, we have Mistral-7B-v0.3, which is supposedly an even better model:
Qwen1.5-0.5B supposedly supported up to 32k context as well, but I can't even get it to summarize a ~2k token input with any level of coherence.
I'm always excited to try a new model, so I'm looking forward to trying Qwen2-0.5B... but I wouldn't get your hopes up this much. These super tiny models seem far more experimental than the larger LLMs.
Phi-3-mini (3.8B) supports a 128k context, and it is actually a reasonably useful model in my tests. Gemma-2B-1.1-it is a 2B model that only supports 8k context, but it also does fairly well for summarization.
Summarization is one of the most difficult tasks for any LLM and over that context window, crazy to think it could do it.
That context window is useful if you have a smaller data extraction task, like dates, times, place names, etc. And even that it might need to be fine tuned on. These small models are a feedstock.
What tasks do you consider a 3.8B model to be useful for? Chat applications on lesser hardware, im still finding it difficult to parse what the real world application would ever be. However, I do understand that the goal is to make the smallest most efficient model to compete with the larger model capabilities one day and you can't get there without making these. But do these types of models have any value for any sort of product or real world project?
I think most of the interesting applications for these small models are in the form of developer-driven automations, not chat interfaces.
A common example that keeps popping up is a voice recorder app that can provide not just a transcription of the recording (which you don't need an LLM for), but also a summary of the transcription, including key topics, key findings, and action items that were discussed in a meeting. With speaker diarization (assigning portions of the transcript to different speakers automatically), it's even possible to use an LLM to assign names to each of the speakers in the transcript, if they ever identified themselves in the meeting, and then the LLM could take that and also know who is supposed to be handling each action item, if that was discussed in the meeting. That's just scratching the surface of what should be possible using small LLMs (or SLMs, as Microsoft likes to call them).
An on-device LLM could summarize notifications if you have a lot of catching up to do, or it could create a title for a note automatically once you finish typing the note, or it could be used to automatically suggest tags/categories for notes. That LLM could be used to provide "completions", like if the user is writing a list of things in a note, the user could click a button to have that LLM generate several more items following the same theme. That LLM can be used to suggest contextually-relevant quick replies for conversations. In a tightly-integrated system, you could imagine receiving a work phone call, and that LLM could automatically summarize your recent interactions with that person (across sms, email, calendar, and slack/teams) for you on the call screen, which could remind you why they're calling you.
LLMs can also be used for data extraction, where they can be given unstructured text, and fill in a data structure with the desired values. As an example, one could imagine browsing a job posting... the browser could use an LLM to detect that the primary purpose of this webpage is a job posting, and then it could pass the text of the page through the LLM and ask the LLM to fill in common values like the job title, company name, salary range, and job requirements, and then the browser could offer a condensed interface with this information, as well as the option to save this information (along with the URL to the job posting) to your "job search" board with one click.
Now, it might be a little much to ask a browser to have special cases for just job postings, when there are so many similar things a user might want to save for later, so you could even let the user define new "boards" where they describe to a (hopefully larger) LLM the purpose of the board and the kinds of information you're looking for, and it would generate the search parameters and data extraction tasks that a smaller LLM would then do in the background as you browse, letting the browser present that information when it is available so that you can choose whether to save it to your board. The larger LLM could still potentially be on-device, but a more powerful LLM that occupies most of the RAM and processing on your device is something you'd only want to use for a foreground task, not eating up resources in the background.
LLMs are interesting because they make it possible to do things that traditional programming could not do in any practical sense. If something can be done without an LLM, then absolutely... do that. LLMs are very computationally intensive, and their accuracy is more like a human than a computer. There are plenty of drawbacks to LLMs, if you have another valid option.
If you are resource limited, remember that you can also play with the quantization to fit more parameters into less amount of RAM. Phi-3-mini [1] (a 3.8B model) is 7.64GB with full (16-bit floating point) precision, but it is only 2.39GB when quantized to 4 bits.
That being said, I haven't personally tested it, but have heard good things for CodeGemma 2B [2].
CodeGemma-2b does not come in the "-it" (instruction tuned) variant, so it can't be used in a chat context. It is just a base model designed for tab completion of code in an editor, which I agree it is pretty good at.
Are you saying that every function can only be called with a consistent set of types in the parameters? It’s not possible to call a function two times and supply parameters that have different fields on them?
Unless that is true, the end result might not be ideal. You could certainly record the information that is there at every line of code at runtime, and you could probably calculate the union of the parameter types to find only the fields that are always there, which would be fairly okay I guess.
Have you considered enforcing a grammar on the LLM when it is generating SQL? This could ensure that it only generates syntactically valid SQL, including awareness of the valid set of field names and their types, and such.
It would not be easy, by any means, but I believe it is theoretically possible.
In our experience building louie.ai for a continuous learning variant of text2query (and for popular DBs beyond SQL), getting syntax right via a symbolic lint phase is a nice speedup, but not the a correctness issue. For syntax, bigger LLMs are generally right on the first shot, and an agent loop autocorrects quickly when the DB gives a syntax error.
Much more time for us goes to things like:
* Getting the right table, column name spelling
* Disambiguating typos when users define names, and deciding whether they mean a specific name or are using a shorthand
* Disambiguating selection when there are multiple for the same thing: hint - this needs to be learned from usage, not by static schema analysis
* Guard rails, such as on perf
* Translation from non-technical user concepts to analyst concepts
* Enterprise DB schemas are generally large and often blow out the LLM context window, or make things slow, expensive, and lossy if you rely on giant context windows
* Learning and team modes so the model improves over time. User teaching interfaces are especially tricky once you expose them - learning fuzzy vs explicit modes, avoid data leakage, ... .
* A lot of power comes from being part of an agentic loop with other tools like Python and charting, which creates a 'composition' problem that requires AI optimization across any sub-AIs
We have been considering OSS this layer of louie.ai, but it hasn't been a priority for our customers, who are the analyst orgs using our UIs on top (Splunk, OpenSearch, Neo4j, Databricks, ...), and occasionally building their own internal tools in top of our API. Our focus has been building a sustainable and high quality project, and these OSS projects seem to be very different to sustain without also solving that, which is hard enough as-is..
Nobody in the original post or this entire discussion said anything about OpenAI until your comment.
I thought it was fairly obvious that we were talking about a local LLM agent... if DataHerald is a wrapper around only OpenAI, and no other options, then that seems unfortunate.
The agent is LLM agnostic and you can use it with OpenAI or self-hosted LLMs. For self hosted LLM we have benchmarked performance with Mixtral for tool selection and CodeLlama for code generation.
There is extremely little quality loss from dropping to 4-bit for LLMs, and that “extremely little” becomes “virtually unmeasurable” loss when going to 8-bit. No one should be running these models on local devices at fp16 outside of research, since fp16 makes them half as fast as q8_0 and requires twice as much RAM for no benefit.
If a model is inadequate for a task at 4-bit, then there's virtually no chance it's going to be adequate at fp16.
Microsoft has also been doing a lot of research into smaller models with the Phi series, and I would be surprised if Phi3 (or a hypothetical Phi4) doesn’t show up at some point under the hood.
I had already read the comment I was responding to, and they actually mentioned both.
Here's the exact quote for the 7B:
"Even running a 7B will take 14GB if it's fp16."
Since they called out a specific amount of memory that is entirely irrelevant to anyone actually running 7B models, I was responding to that.
I'm certain that no one at Microsoft is talking about running 70B models on consumer devices. 7B models are actually a practical consideration for the hardware that exists today.
> > Since they called out a specific amount of memory that is entirely irrelevant to anyone actually running 7B models, I was responding to that.
> Which is correct, fp16 takes two bytes per weight, so it will be 7 billion * 2 bytes which is exactly 14GB.
As I said, it is "entirely irrelevant", which is the exact wording I used. Nowhere did I say that the calculation was wrong for fp16. Irrelevant numbers like that can be misleading to people unfamiliar with the subject matter.
No one is deploying LLMs to end users at fp16. It would be a huge waste and provide a bad experience. This discussion is about Copilot+, which is all about managed AI experiences that "just work" for the end user. Professional-grade stuff, and I believe Microsoft has good enough engineers to know better than to deploy fp16 LLMs to end users.
No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?
I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.
A useless benchmark is useless.
In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.