“prompt” is arguably a misnomer. In other implementations, it is correctly called “initial-prompt”, and even the whisper.cpp help describes it as an initial prompt.
It only affects the first 30 second window of the transcription, as far as I’ve been able to tell. If the word in question appears in that window, then it will influence the next window, and so on… but as soon as it doesn’t exist in one 30 second window, it’s effectively gone, from what I’ve seen.
It’s “underused” precisely because this feature is pretty much useless if you’re transcribing anything other than quick snippets of speech.
It’s also hard to use since you have to know in advance what hard-to-transcribe words are going to be in the audio.
We need a better solution. It would be much better if there were an easy way to fine tune Whisper to learn new vocab.
> It’s “underused” precisely because this feature is pretty much useless if you’re transcribing anything other than quick snippets of speech.
I'm not sure why you're so dismissive when real-time transcription is an important use-case that falls under that bucket of "quick snippets".
> It’s also hard to use since you have to know in advance what hard-to-transcribe words are going to be in the audio.
I think it's more context-dependent than it is "hard". It's ideal for streaming meeting transcripts. In my use-cases, I use the prompt to feed in participant names, company terms/names, and other potential words. It's also much easier to just rattle off a list of potential words that you know are going to be in the transcription that are difficult or spelled differently.
> We need a better solution. It would be much better if there were an easy way to fine tune Whisper to learn new vocab.
Prompting is infinitely easier than fine-tuning in every aspect. I can reuse the same model in any context and just swap out the prompt. I don't have to spend time/money finetuning... I don't have to store multiple fine-tuned copies of whisper for different contexts... I'm not sure what better solution you envision but fine-tuning is certainly not easier than prompting.
Real time transcription is not necessarily short snippets. In my experience, initial prompt is useless beyond the first 30 seconds if the words in the initial prompt aren’t used every 30 seconds, including the first 30.
It may be easy to rattle off a list of words, but it doesn’t work nearly as well as it should, so what’s the point? I also never said fine tuning would be easier than prompting. I said it would be better. It would just need to be easier than fine tuning currently is, not easier than prompting.
Fine tuning that I’m talking about would not be limited to only a few new words. You would only need one model, like we have today. It would just be your model that knows all the specific words and spellings you prefer. By analogy to other machine learning models, I would expect a lightweight LoRA approach would also work.
I just haven’t seen anyone working on these solutions that would actually be scalable, unlike the initial prompt.
Initial prompt works in extremely specific scenarios, but it has been so unreliable for long transcripts in my experience that I certainly don’t bother with it anymore. Someone mentioned Alexa-style home assistants, which would have short enough audio snippets that initial prompt would actually be useful.
But it will influence the initial text generated, which influences the subsequent text as well. So it theoretically influences the whole thing, just diluted and indirectly.
The article links to their huggingface page[0], which offers both chat and non-chat models, and it appears that they come with the code necessary to be run, but I have not actually tried to run them.
Except, I intentionally don’t use either form because they don’t extend nicely (and I dislike using flags when I could use another pipe segment or positional arguments when I could use standard input). I can iterate quickly by adding pipeline segments on top of the basic formula, because all of the segments have the same general shape and “calling convention”.
Finally, because I’ve built up familiarity with the shell over my career, I can come up with this formula as fast as I can type it. At this point, ChatGPT would slow me down: (1) because this sort of thing is basically muscle memory and (2) I have to actually think about the code ChatGPT produces to verify it is correct, which is almost as difficult as producing it in the first place.
I laid out the constraints, but I did not mention reservoir sampling at all. The script seems to work as expected when I run it against a dictionary file.
Not bad, but suppose the dictionary has n lines and you only want to randomly sample k=100 of them, where n is so huge that you don't want to scan over the whole file at all.
Can you use random access into the file to sample k lines in O(k) time instead of O(n) time?
That is a problematic request for multiple obvious reasons, and for those same reasons, ChatGPT resisted providing an implementation that didn't require indexing the file. By telling it "no indexing is allowed, provide a best effort solution" it relented and provided a best effort solution.
> That is a problematic request for multiple obvious reasons
I'd prefer to think it's more like a real engineering problem, and less like a simple interview question :-)
And it definitely shows the limits of GPT here: it pointed out that the ends of the file might be tricky, but ignored the very conceptually simple solution of considering the file as circular (if you go past either end you simply wrap around).
And it misses the real problem with its implementation: the probability of sampling each line is now directly proportional to the length of the line before it (because it seeks into that line first and then skips it!)
So the word after "begins" is twice as likely to come up as the word after "and".
PS in the case of dictionary words with a length limit of say 30 letters, there is still an O(k) general solution using rejection sampling.
"Remember, this is a probabilistic approach and works well if the lines in your file are roughly the same length. If the line lengths vary significantly, some lines will have a higher or lower chance of being selected."
It had already addressed "the real problem with its implementation" that you pointed out.
> PS in the case of dictionary words with a length limit of say 30 letters, there is still an O(k) general solution using rejection sampling.
Again, what ChatGPT wrote:
"In a typical scenario where lines can have variable lengths, true O(k) random sampling isn't feasible without some prior knowledge about the file."
Knowing that the limit is 30 characters without question counts as "some prior knowledge".
As an interviewer, it sounds like you're not hearing what the candidate is saying.
> And it definitely shows the limits of GPT here
I don't think anyone here is claiming that ChatGPT is limitless. The topic is "a coder considers the waning days of the craft", not "a coder considers the bygone days of the craft." ChatGPT is capable of solving many real world problems already. If it continues improving, some people are concerned about what that could mean, especially for less experienced developers.
How many people have you interviewed with that brainteaser that have actually provided the complete solution you're looking for? Vanishingly few, I would imagine, unless you were dropping some serious hints. It's not a real world problem. Most brainteasers have solutions that are "conceptually simple" once you already know the solution.
> I'd prefer to think it's more like a real engineering problem, and less like a simple interview question
It's absolutely not, though. It's exactly like the infamous trick questions that many tech interviews are known for, which have nothing to do with real engineering that you would encounter on the job.
You might as well have someone invert a binary tree for all the value that it provides.
For 128k context (even with a 7B model), I don't think 8GB is nearly enough. I've heard there might be tricks to get it under 24GB... but I haven't personally seen it under about 28GB, IIRC. Definitely not heard of it being possible with 8GB.
No... TBW scales with the size of the disk. The 870 EVO is not uniformly rated for 2400TBW. Different technologies will have different TBW per GB of storage, but for a line of SSDs with the same technology, it almost always scales linearly by the amount of storage.
The 250GB model of the 870 EVO is warrantied for 150TBW. Many of the laptops Apple sells with 8GB of RAM only come with a 256GB SSD. Surely nobody is buying an Apple laptop with 8GB of RAM and a 4TB SSD.
Is 5 months a good enough lifespan for a computer?
I am personally skeptical that most people would be seeing 1TBW per day, but I firmly believe that 8GB is unjustifiably low. Apple offers 24GB as an option, so they could (and should) offer 12GB as the base spec if they're unwilling to make 16GB the base spec.
"Warrantied TBW for 870 EVO: 150 TBW for 250 GB model, 300 TBW for 500 GB model, 600 TBW for 1 TB model, 1,200 TBW for 2 TB model and 2,400 TBW for 4 TB model"[0]
Apple weirdly limits SSD size to config. You can't buy a 4TB 8GB M3 MBP. It caps out at 2TB. If you want 4TB you need a M3 Pro or Max and 8TB is also only available with a Max for example.
OpenAI offering 128k context is very appealing, however.
I tried some Mistral variants with larger context windows, and had very poor results… the model would often offer either an empty completion or a nonsensical completion, even though the content fit comfortably within the context window, and I was placing a direct question either at the beginning or end, and either with or without an explanation of the task and the content. Large contexts just felt broken. There are so many ways that we are more than “two weeks” from the open source solutions matching what OpenAI offers.
And that’s to say nothing of how far behind these smaller models are in terms of accuracy or instruction following.
For now, 6-12 months behind also isn’t good enough. In the uncertain case that this stays true, then a year from now the open models could be perfectly adequate for many use cases… but it’s very hard to predict the progression of these technologies.
It's very compelling and opens up a lot of use cases, so I've been keeping an eye out for advancements. However, inferencing on 4xA100s would be the target today for YaRN and 128K to get a reasonable token rate on their version of Mistral.
The person I replied to had decided to compare Mistral to what was launched, so I went along with their comparison and showed how I have been unsatisfied with it. But, these open models can certainly be fun to play with.
Regardless, where did you find 1.8T for GPT-4 Turbo? The Turbo model is the one with the 128K context size, and the Turbo models tend to have a much lower parameter count from what people can tell. Nobody outside of OpenAI even knows how many parameters regular GPT-4 has. 1.8T is one of several guesses I have seen people make, but the guesses vary significantly.
I’m also not convinced that parameter counts are everything, as your comment clearly implies, or that chinchilla scaling is fully understood. More research seems required to find the right balance: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...
Nah, it's training quality and context saturation.
Grab an 8K context model, tweak some internals and try to pass 32K context into it - it's still an 8K model and will go glitchy beyond 8K unless it's trained at higher context lengths.
Anthropic for example talk about the model's ability to spot words in the entire Great Gatsby novel loaded into context. It's a hint to how the model is trained.
Parameter counts are a unified metric, what seems to be important is embedding dimensionality to transfer information through the layers - and the layers themselves to both store and process the nuance of information.
Let's just agree it's 100x-300x more parameters, and let's assume the open ai folks are pretty smart and have a sense for the optimal number of tokens to train on.
This definitely. Andrej Karpathy himself mentions tuned weight initialisation in one of his lectures. The TinyGPT code he wrote goes through it.
Additionally explanations for the raw mathematics of log likelihoods and their loss ballparks.
Interesting low-level stuff. These researchers are the best of the best working for the company that can afford them working on the best models available.
That's my take-away from limited attempts to get Code Llama2 Instruct to implement a moderately complex spec as well, using special INST and SYS tokens even or just pasting some spec text along in a 12k context when Code Llama2 supposedly can honor up to 100k tokens. And I don't even know how to combine code infilling with an elaborate spec text exceeding the volume of what normally goes into code comments. Is ChatGPT 4 really any better?
It only affects the first 30 second window of the transcription, as far as I’ve been able to tell. If the word in question appears in that window, then it will influence the next window, and so on… but as soon as it doesn’t exist in one 30 second window, it’s effectively gone, from what I’ve seen.
It’s “underused” precisely because this feature is pretty much useless if you’re transcribing anything other than quick snippets of speech.
It’s also hard to use since you have to know in advance what hard-to-transcribe words are going to be in the audio.
We need a better solution. It would be much better if there were an easy way to fine tune Whisper to learn new vocab.