Hacker Newsnew | past | comments | ask | show | jobs | submit | more notsylver's commentslogin

opt-in but afaik they still show up in places unless you disable them


This looks a lot more impressive than a lot of GitHub Copilot alternatives I've seen. I wonder how hard it would be to port this to vscode - using remote models for inline completion always seemed wrong to me, especially with server latency and network issues


Based on the blogpost, this appears to be hosted remotely on baseten. The model just happens to be released openly, so you can also download it, but the blogpost doesn't talk about any intention to help you run it locally within the editor. (I agree that would be cool, I'm just commenting on what I see in the article.)

On the other hand, network latency itself isn't really that big of a deal... a more powerful GPU server in the cloud can typically run so much faster that it can make up for the added network latency and then some. Running locally is really about privacy and offline use cases, not performance, in my opinion.

If you want to try local tab completions, the Continue plugin for VSCode is a good way to try that, but the Zeta model is the first open model that I'm aware of that is more advanced than just FIM.


I'm stuck using somewhat unreliable starlink to a datacenter ~90ms away, but I can run 7b models fine locally. I agree though, cloud completions aren't unusably slow/unreliable for me, it's mostly about privacy and it being really fun.

I tried continue a few times, I could never get consistent results, the models were just too dumb. That's why I'm excited about this model, it seems like a better approach to inline completion and might be the first okay enough™ model for me. Either way, I don't think I can replace copilot until a model can automatically fine tune itself in the background on the code I've written


> Either way, I don't think I can replace copilot until a model can automatically fine tune itself in the background on the code I've written

I don't think Copilot does this... it's really just a matter of the editor plug-in being smart enough to grab all of the relevant context and provide that to the model making the completions; a form of RAG. I believe organizations can pay to fine-tune Copilot, but it sounds more involved than something that happens automatically.

Depending on when you tried Continue last, one would hope that their RAG pipeline has improved over time. I tried it a few months ago and I thought codegemma-2b (base) acting as a code completion model was fine... certainly not as good as what I've experienced with Cursor. I haven't tried GitHub Copilot in over a year... I really should try it again and see how it is these days.


I think it would be more interesting doing this with smaller models (33b-70b) and see if you could get 5-10 tokens/sec on a budget. I've desperately wanted something locally thats around the same level of 4o, but I'm not in a hurry to spend $3k on an overpriced GPU or $2k on this


Your best bet for 33B is already having a computer and buying a used RTX 3090 for <$1k. I don't think there's currently any cheap options for 70B that would give you >5. High memory bandwidth is just too expensive. Strix Halo might give you >5 once it comes out, but will probably be significantly more than $1k for 64 GB RAM.


With used GPUs do you have to be concerned that they're close to EOL due to high utilization in a Bitcoin or AI rig?


I guess it will be a bigger issue the longer it's been since they stopped making them, but most I've heard (including me) haven't had any issue. Crypto rigs don't necessarily break GPUs faster because they care about power consumption and run the cards at a pretty even temperature. What probably breaks first is the fans. You might also have to open the card up and repaste/repad them to keep the cooling under control.


awesome thanks!


GPUs were last used for Bitcoin mining in 2013, so you shouldn't be concerned unless you are buying a GTX 780.


M4 Mac with unified GPU RAM

Not very cheap though! But you get a quite usable personal computer with it...


Any that can run 70B at >5 t/s are >$2k as far as I know.


How does inference happen on a GPU with such limited memory compared with the full requirements of the model? This is something I’ve been wondering for a while


You can run a quantized version of the model to reduce the memory requirements, and you can do partial offload, where some of the model is on GPU and some is on CPU. If you are running a 70B Q4, that’s 40-ish GB including some context cache, and you can offload at least half onto a 3090, which will run its portion of the load very fast. It makes a huge difference even if you can’t fit every layer on the GPU.


So the more GPUs we have the faster it will be and we don't have to have the model run solely CPU or GPU -- it can be combined. Very cool. Think that is how it's running now with my single 4090.


Umm, two 3090's? Additional cards scale as long as you have enough PCIe channels.


I arbitrarily chose $1k as the "cheap" cut-off. Two 3090 is definitely the most bang for the buck if you can fit them.


Apple M chips with their unified GPU memory are not terrible. I have one of the first M1 Max laptops with 64G and it can run up to 70B models at very useful speeds. Newer M series are going to be faster and they offer more RAM now.

Are there any other laptops around other than the larger M series Macs that can run 30-70B LLMs at usable speeds that also have useful battery life and don’t sound like a jet taxiing to the runway?

For non-portables I bet a huge desktop or server CPU with fast RAM beats the Mac Mini and Studio for price performance, but I’d be curious to see benchmarks comparing fast many core CPU performance to a large M series GPU with unified RAM.


As a data point: you can get an RTX 3090 for ~$1.2k and it runs deepseek-r1:32b perfectly fine via Ollama + open webui at ~35 tok/s in an OpenAI-like web app and basically as fast as 4o.


You mean Qwen 32b fine-tuned on Deepseek :)

There is only one model of Deepseek (671b), all others are fine-tunes of other models


> you can get an RTX 3090 for ~$1.2k

If you're paying that much you're being ripped off. They're $800-900 on eBay and IMO are still overpriced.


It will be slower for a 70b model since Deepseek is an MoE that only activates 37b at a time. That's what makes CPU inference remotely feasible here.


Would it be something like this?

> OpenAI's nightmare: DeepSeek R1 on a Raspberry Pi

https://x.com/geerlingguy/status/1884994878477623485

I haven't tried it myself or haven't verified the creds, but seems exciting at least


That's 1.2 t/s for the 14B Qwen finetune, not the real R1. Unless you go with the GPU with the extra cost, but hardly anyone but Jeff Geerling is going to run a dedicated GPU on a Pi.


it's using a Raspberry Pi with a.... USD$1k gpu, which kinda defeat the purpose of using the RPI in the first place imo.

or well, I guess you save a bit on power usage.


I suppose it makes sense, for extremely GPU centric applications, that the pi be used essentially as a controller for the 3090.


Oh, I was naive to think that the Pi was capable of some kind of magic (sweaty smile emoji goes here)


I put together a $350 build with a 3060 12GB and its still my favorite build. I run llama 3.2 11b q4 on it and its a really efficient way to get started and the tps is great.


You can run smaller models on MacbookPro with ollama with those speeds. Even with several 3k GPUs it won't come close to 4o level.


Could you get the speed and location of the plane and estimate how far it should move across two images to determine if its the right target?


That would work if inference on the Pi was faster. Right now it takes about 2.5s per image. The planes are in view for maybe 3s. By the time the next frame is fetched, the plane's already out of view.


I've had the opposite experience - I tried continue.dev and for me it doesn't come close to copilot. Especially with copilot chat having o1-preview and Sonnet 3.5 for so cheap that I single handedly might bankrupt microsoft (we can hope), but I tried it before that was availlable and the inline completions were laughably bad in comparison.

I used the recommended models and couldn't figure it out, I assume I did something wrong but I followed the docs and triple checked everything. It'd be nice to use the GPU I have locally for faster completions/privacy, I just haven't found a way to do that.


The last couple times I tried "continue" it felt like "Step 1" in someone's business plan; bulky and seconds away from converting into a paid subscription model.

Additionally, I've tried a bunch of these (even the same models, etc) and they've all sucked compared to Copilot. And believe me, I want that local-hosted sweetness. Not sure what I'm doing wrong when others are so excited by it.


I just tried Continue and it was death by 1000 paper cuts. And by that I mean 1000 accept/reject blocks.

And at some point I asked to change a pretty large file in some way. It started processing, very very slowly and I couldn't figure out a way to stop it. Had to restart VS Code as it still kept changing the file 10 minutes later.

Copilot was also very slow when I tried it yesterday but at least there was a clear way to stop it.


I doubt it, but it would be interesting if they recorded Stadia sessions and trained on that data (... somehow removing the hud?), seems like it would be the easiest way for them to get the data for this.


Seems somewhat likely to me. They probably even trained a model to do both frame generation and upscaling to allow the hardware to work more efficiently while being able to predict the future based on user input (to reduce perceived latency). Seems like Genie is just that but extrapolated much further.


i've been using it for... years? and it still feels like magic when i use it. v4 fixes my only real issue with it which was the config file feeling so detached and rough compared to everything else.


I've been doing a lot of OCR recently, mostly digitising text from family photos. Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close. It still had enough failures and hallucinations to make it faster to write it in by hand. Annoying considering how close it feels to working.

This seems worse. Sometimes it replies with just the text, sometimes it replies with a full "The image is a scanned document with handwritten text...". I was hoping for some fine tuning or something for it to beat Gemini Flash, it would save me a lot of time. :(


Have you tried downscaling the images? I started getting better results with lower resolution images. I was using scans made with mobile phone cameras for this.

convert -density 76 input.pdf output-%d.png

https://github.com/philips/paper-bidsheets


That's interesting. I downscaled the images to something like 800px but that was mostly to try improve upload times. I wonder if downscaling further and with a better algorithm would help.. I remember using CLIP and found different scaling algorithms helped text readability. Maybe the text is just being butchered when its rescaled.

Though I also tried with the high detail setting which I think would deal with most issues that come from that and it didn't seem to help much


>Normal OCR models are terrible at it, LLMs do far better. Gemini Flash came out on top from the models I tested and it wasn't even close.

For Normal models, the state of Open Source OCR is pretty terrible. Unfortunately, the closed options from Microsoft, Google etc are much better. Did you try those ?

Interesting about Flash, what LLMs did you test ?


I tried open source and closed source OCR models, all were pretty bad. Google vision was probably the best of the "OCR" models, but it liked adding spaces between characters and had other issues I've forgotten. It was bad enough that I wondered if I was using it wrong. By the time I was trying to pass the text to an LLM with the image so it could do "touchups" and fix the mistakes, I gave up and decided to try LLMs for the whole task.

I don't remember the exact models, I more or less just went through the OpenRouter vision model list and tried them all. Gemini Flash performed the best, somehow better than Gemini Pro. GPT-4o/mini was terrible and expensive enough that it would have had to be near perfect to consider it. Pixtral did terribly. That's all I remember, but I tried more than just those. I think Llama 3.2 is the only one I haven't properly tried, but I don't have high hopes for it.

I think even if OCR models were perfect, they couldn't have done some of the things I was using LLMs for. Like extracting structured information at the same time as the plain text - extracting any dates listed in the text into a standard ISO format was nice, as well as grabbing peoples names. Being able to say "Only look at the hand-written text, ignore printed text" and have it work was incredible.


WordNinja is pretty good as a post-processing step on wrongly split/concatenated words:

[0]: https://github.com/keredson/wordninja


The OCR in OneNote is incredible IME. But, I've not tested in a wide range of fonts -- only that I have abysmal handwriting and it will find words that are almost unrecognisable.


I've had really good luck recently running OCR over a corpus of images using gpt-4o. The most important thing I realized was that non-fancy data prep is still important, even with fancy LLMs. Cropping my images to just the text (excluding any borders) and increasing the contrast of the image helped enormously. (I wrote about this in 2015 and this post still holds up well with GPT: https://www.danvk.org/2015/01/07/finding-blocks-of-text-in-a...).

I also found that giving GPT at most a few paragraphs at a time worked better than giving it whole pages. Shorter text = less chance to hallucinate.


Have you tried doing a verification pass: so giving gpt-4o the output of the first pass, and the image, and asking if they can correct the text (or if they match, or...)?

Just curious whether repetition increases accuracy or of it hurt increases the opportunities for hallucinations?


I have not, but that's a great idea!


That's a bummer. I'm trying to do the exact same thing right now, digitize family photos. Some of mine have German on the back. The last OCR to hit headlines was terrible, was hoping this would be better. ChatGPT 4o has been good though, when I paste individual images into the chat. I haven't tried with the API yet, not sure how much that would cost me to process 6500 photos, many of which are blank but I don't have an easy way to filter them either.


I found 4o to be one of the worst, but I was using the API. I didn't test it but sometimes it feels like images uploaded through ChatGPT work better than ones through the API. I was using Gemini Flash in the end, it seemed better than 4o and the images are so cheap that I have a hard time believing google is making any money even by bandwidth costs

I also tried preprocessing images before sending them through. I tried cropping it to just the text to see if it helped. Then I tried filtering on top to try brighten the text, somehow that all made it worse. The most success I had was just holding the image in my hand and taking a photo of it, the busy background seemed to help but I have absolutely no idea why.

The main problem was that it would work well for a few dozen images, you'd start to trust it, and then it'd hallucinate or not understand a crossed out word with a correction or wouldn't see text that had faded. I've pretty much given up on the idea. My new plan is to repurpose the website I made for verifying the results into one where you enter the text manually, as well as date/location/favourite status.


Use a local rubbish model to extract text. If it doesn’t find any on the back, don’t send it to chatgtp?

Terrascan comes to mind


"Terrascan" is a vision model? The only hits I'm getting are for a static code analyzer.


sorry, i meant "Tesseract"


Have you tried Claude?

It's not good at returning the locations of text (yet), but it's insane at OCR as far as I have tested.


As soon as I saw this was part of llamafile I was hoping that it would be used to limit LLM output to always be "valid" code as soon as it saw the backticks, but I suppose most LLMs don't have problems with that anyway. And I'm not sure you'd want something like that automatically forcing valid code anyway


llama.cpp does support something like this -- you can give it a grammar which restricts the set of available next tokens that are sampled over

so in theory you could notice "```python" or whatever and then start restricting to valid python code. (in least in theory, not sure how feasible/possible it would be in practice w/ their grammar format.)

for code i'm not sure how useful it would be since likely any model that is giving you working code wouldn't be struggling w/ syntax errors anyway?

but i have had success experimentally using the feature to drive fiction content for a game from a smaller llm to be in a very specific format.


yeah, ive used llama.cpp grammars before, which is why i was thinking about it. i just think it'd be cool for llamafile to do basically that, but with included defaults so you could eg, require JSON output. it could be cool for prototyping or something. but i dont think that would be too useful anyway, most of the time i think you would want to restrict it to a specific schema, so i can only see it being useful for something like a tiny local LLM for code completion, but that would just encourage valid-looking but incorrect code.

i think i just like the idea of restricting LLM output, it has a lot of interesting use cases


gotchya. i do think that is a cool idea actually -- LLMs tiny enough to do useful things with formally structured output but not big enough to nail the structure ~100% is probably not an empty set.


Solved, just not by Tesla who have been promising it every year since.. 2016?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: