Hacker Newsnew | past | comments | ask | show | jobs | submit | more coder543's commentslogin

That paper is from over a year ago, and it compared against codex-davinci... which was basically GPT-3, from what I understand. Saying >100B makes it sound a lot more impressive than it is in today's context... 100B models today are a lot more capable. The researchers also compared against a couple of other ancient(/irrelevant today), small models that don't give me much insight.

FLAME seems like a fun little model, and 60M is truly tiny compared to other LLMs, but I have no idea how good it is in today's context, and it doesn't seem like they ever released it.


I would like to disagree with its being irrelevant. If anything, the 100B models are irrelevant in the context and should be seen as a "fun inclusion" rather than a serious addition worth comparing against. It out-performing a 100B model at the time becomes a fun bragging point, but it's not the core value of the method or paper.

Running a prompt against every single cell of a 10k row document was never gonna happen with a large model. Even using a transformer model architecture in the first place can be seen as ludicrous overkill but feasible on modern machines.

So I'd say the paper is very relevant, and the top commenter in this very thread demonstrated their own homegrown version with a very nice use-case (paper abstract and title sorting for making a summary paper)


> Running a prompt against every single cell of a 10k row document was never gonna happen with a large model

That isn’t the main point of FLAME, as I understood it. The main point was to help you when you’re editing a particular cell. codex-davinci was used for real time Copilot tab completions for a long time, I believe, and editing within a single formula in a spreadsheet is far less demanding than editing code in a large document.

After I posted my original comment, I realized I should have pointed out that I’m fairly sure we have 8B models that handily outperform codex-davinci these days… further driving home how irrelevant the claim of “>100B” was here (not talking about the paper). Plus, an off the shelf model like Qwen2.5-0.5B (a 494M model) could probably be fine tuned to compete with (or dominate) FLAME if you had access to the FLAME training data — there is probably no need to train a model from scratch, and a 0.5B model can easily run on any computer that can run the current version of Excel.

You may disagree, but my point was that claiming a 60M model outperforms a 100B model just means something entirely different today. Putting that in the original comment higher in the thread creates confusion, not clarity, since the models in question are very bad compared to what exists now. No one had clarified that the paper was over a year old until I commented… and FLAME was being tested against models that seemed to be over a year old even when the paper was published. I don’t understand why the researchers were testing against such old models even back then.


Someone mentioned generating millions of (very short) stories with an LLM a few weeks ago: https://news.ycombinator.com/item?id=42577644

They linked to an interactive explorer that nicely shows the diversity of the dataset, and the HF repo links to the GitHub repo that has the code that generated the stories: https://github.com/lennart-finke/simple_stories_generate

So, it seems there are ways to get varied stories.


I was wondering where the traffic came from, thanks for mentioning it!


Maybe function calling using JSON blobs isn't even the optimal approach... I saw some stuff recently about having LLMs write Python code to execute what they want, and LLMs tend to be a lot better at Python without any additional function-calling training. Some of the functions exposed to the LLM can be calls into your own logic.

Some relevant links:

This shows how python-calling performance is supposedly better for a range of existing models than JSON-calling performance: https://huggingface.co/blog/andthattoo/dpab-a#initial-result...

A little post about the concept: https://huggingface.co/blog/andthattoo/dria-agent-a

Huggingface has their own "smolagents" library that includes "CodeAgent", which operates by the same principle of generating and executing Python code for the purposes of function calling: https://huggingface.co/docs/smolagents/en/guided_tour

smolagents can either use a local LLM or a remote LLM, and it can either run the code locally, or run the code on a remote code execution environment, so it seems fairly flexible.


I noticed some surprising load times on Codeberg’s Forgejo instance. For example:

- The first page of releases (out of only 63 releases total) takes 3–5.5 seconds to load: https://codeberg.org/forgejo/forgejo/releases

- The issues page is faster at ~1.5 seconds, but still a bit slow: https://codeberg.org/forgejo/forgejo/issues

- The commits page for the main branch (22k commits) is much slower, taking over 11 seconds: https://codeberg.org/forgejo/forgejo/commits/branch/forgejo

These load times were surprising to me given the relatively small amount of data being loaded (just the first page of results)... It feels like there could be an inefficient query at play here? The HTML responses aren’t huge (~400kB), and my ping to Codeberg (~125ms, US<-->Berlin?) shouldn’t be a major factor when just loading a single HTML document without factoring in other resources. I also have gigabit internet, and while there could be bottlenecks between here and Europe, they surely wouldn’t be responsible for slowing things down to this degree.

For comparison, I’ve run Gitea on a local server, and it’s been lightning fast, even with larger datasets. For example, on a test with the Linux kernel repo (1.3M commits), Gitea rendered the first page of commits in under 500ms. That’s a stark contrast to the 11 seconds it took on Codeberg’s Forgejo instance for just 22k commits.

I wonder if this slowness is more of a Codeberg hosting issue or something inherent to Forgejo, but I haven't tried Forgejo locally.


If you want an explanation for the slowness look here: https://forgeperf.org/

I believe Tree (worst case) is the codeberg slowness when viewing the commits page.


It agrees with what I was seeing, but it doesn’t really seem to explain much. I still don’t know if it is Codeberg-specific or Forgejo-specific, or why either of them would be slow for this task when Gitea local to me can go much faster (even accounting for ping latency).


I thought it was an interesting post, so I tried to add Railway's blog to my RSS reader... but it didn't work. I tried searching the page source for RSS and also found nothing. Eventually, I noticed the RSS icon in the top right, but it's some kind of special button that I can't right click and copy the link from, and Safari prevents me from knowing what the URL is... so I had to open that from Firefox to find it.

Could be worth adding a <meta> tag to the <head> so that RSS readers can autodiscover the feed. A random link I found on Google: https://www.petefreitag.com/blog/rss-autodiscovery/


> My conversation quickly began to approach the context window for the LLM and some RAG engineering is very necessary to keep the LLM informed about the key parts of your history

Assuming we're talking about GPT-4o, that 128k context window theoretically corresponds to somewhere around 73,000 words. People talk at around 100 words per minute in conversation, so that would be about 730 minutes of context, or about 12 hours. The Gemini models can do up to 2 million tokens of context... which we could extrapolate to 11,400 minutes of context (190 hours), which might be enough?

I would say GPT-4o was only good up to about 64k tokens the last time I really tested large context stuff, so let's call that 6 hours of context. In my experience, Gemini's massive context windows are actually able to retain a lot of information... it's not like there's only 64k usable or something. Google has some kind of secret sauce there.

One could imagine architecting the app to use Gemini's Context Caching[0] to keep response times low, since it wouldn't need to re-process the entire session for every response. The application would just spin up a new context cache in the background every 10 minutes or so and delete the old one, reducing the amount of recent conversation that would have to be re-processed each time to generate a response.

I've just never seen RAG work particularly well... and fitting everything into the context is very nice by comparison.

But, one alternative to RAG would be a form of context compression... you could give the LLM several tools/functions for managing the context. The LLM would be instructed to use these tools to record (and update) the names and information of different characters, places, and items that the players encounter, important events that have occurred during the game, as well as information about who the current players are and what items and abilities those players have, and then the LLM would be provided with this "memory" in the context in place of a complete conversational record. The LLM would then just receive (for example) the most recent 15 or 30 minutes of conversation, in addition to that memory.

> I found the LLM to be too pliable as a DM.

I haven't tried using an LLM as a DM, but in my experience, GPT-4o is happy to hold its ground on things. This isn't like the GPT-3.5 days where it was a total pushover for anything and everything. I believe the big Gemini models are also stronger than the old models used to be in this regard. Maybe you just need a stricter prompt for the LLM that tells it how to behave?

I also think the new trend of "reasoning" models could be very interesting for use cases like this. The model could try to (privately) develop a more cohesive picture of the situation before responding to new developments. You could already do this to some extent by making multiple calls to the LLM, one for the LLM to "think", and then another for the LLM to provide a response that would actually go to the players.

One could also imagine giving the LLM access to other functions that it could call, such as the ability to play music and sound effects from a pre-defined library of sounds, or to roll the dice using an external random number generator.

> 4. Most importantly, I found that I most enjoy the human connection that I get through DnD and an LLM with a voice doesn't really satisfy that.

Sure, maybe it's not something people actually want... who knows. But, I think it looks pretty fun.[1]

One of the harder things with this would be helping the LLM learn when to speak and when to just let the players talk amongst themselves. A simple solution could just be to have a button that the players can press when they want, which will then trigger the LLM to respond to what's been recently said, but it would be cool to just have a natural flow.

[0]: https://ai.google.dev/gemini-api/docs/caching

[1]: https://www.youtube.com/watch?v=9oBdLUEayGI


The usual bottleneck for self-hosted LLMs is memory bandwidth. It doesn't really matter if there are integrated graphics or not... the models will run at the same (very slow) speed on CPU-only. Macs are only decent for LLMs because Apple has given Apple Silicon unusually high memory bandwidth, but they're still nowhere near as fast as a high-end GPU with extremely fast VRAM.

For extremely tiny models like you would use for tab completion, even an old AMD CPU is probably going to do okay.


Good to know. It also looks like you can host TabbyML as an on-premise server with docker and serve requests over a private network. Interesting to think that a self-hosted GPU server might become a thing.


gpt-4o-mini might not be the best point of reference for what good LLMs can do with code: https://aider.chat/docs/leaderboards/#aider-polyglot-benchma...

A teeny tiny model such as a 1.5B model is really dumb, and not good at interactively generating code in a conversational way, but models in the 3B or less size can do a good job of suggesting tab completions.

There are larger "open" models (in the 32B - 70B range) that you can run locally that should be much, much better than gpt-4o-mini at just about everything, including writing code. For a few examples, llama3.3-70b-instruct and qwen2.5-coder-32b-instruct are pretty good. If you're really pressed for RAM, qwen2.5-coder-7b-instruct or codegemma-7b-it might be okay for some simple things.

> medium specced macbook pro

medium specced doesn't mean much. How much RAM do you have? Each "B" (billion) of parameters is going to require about 1GB of RAM, as a rule of thumb. (500MB for really heavily quantized models, 2GB for un-quantized models... but, 8-bit quants use 1GB, and that's usually fine.)


Also context size significantly impacts ram/vram usage and in programming those chats get big quickly


Thanks for your explanation! Very helpful!


Google's research blog does not seem to provide this, but many blogs include the Open Graph metadata[0] around when the article was published or modified:

    article:published_time - datetime - When the article was first published.
    article:modified_time - datetime - When the article was last changed.
For example, I pulled up a random article on another website, and found these <meta> tags in the <head>:

    <meta property="article:published_time" content="2025-01-11T13:00:00.000Z">
    <meta property="article:modified_time" content="2025-01-11T13:00:00.000Z">
For pages that contain this metadata, it would be a cheaper/faster implementation than using an LLM, but using an LLM as a fallback could easily provide you with the publication date of this Google article.

[0]: https://ogp.me/


Your original solution of binding to 127.0.0.1 generally seems fine. Also, if you're spinning up a web app and its supporting services all in Docker, and you're really just running this on a single $3/mo instance... my unpopular opinion is that docker compose might actually be a fine choice here. Docker compose makes it easy for these services to talk to each other without exposing any of them to the outside network unless you intentionally set up a port binding for those services in the compose file.


You should try swarm. It solves a lot of challenges that you would otherwise have while running production services with compose. I built rove.dev to trivialize setup and deployments over SSH.


What does swarm actually do better for a single-node, single-instance deployment? (I have no experience with swarm, but on googling it, it looks like it is targeted at cluster deployments. Compose seems like the simpler choice here.)


Swarm works just as well in a single host environment. It is very similar to compose in semantics, but also does basic orchestration that you would have to hack into compose, like multiple instances of a service and blue/green deployments. And then if you need to grow later, it can of course run services on multiple hosts. The main footgun is that the Swarm management port does not have any security on it, so that needs to be locked down either with rove or manual ufw config.


Interesting, in my mind Swarm was more or less dead and the next step after docker+compose or podman+quadlet was k3s. I will check out Rove, thanks!


That was rumored for a while, but Swarm is still maintained! I wouldn't count on it getting the latest and greatest compose format support though.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: