More

NitpickLawyer · 2025-12-24T12:45:32 1766580332

> Dunno about you but to me it reads as a failure.

???

This is a wild take. Goog is incredibly well positioned to make the best of this AI push, whatever the future holds.

If it goes to the moon, they are up there, with their own hardware, tons of data, and lots of innovations (huge usable context, research towards continuous learning w/ titans and the other one, true multimodal stuff, etc).

If it plateaus, they are already integrating into lots of products, and some of them will stick (office, personal, notebooklm, coding-ish, etc.) Again, they are "self sustainable" on both hardware and data, so they'll be fine even if this thing plateaus (I don't think it will, but anyway).

To see this year as a failure for google is ... a wild take. No idea what you're on about. They've been tearing it for the past 6 months, and gemini3 is an insane pair of models (flash is at or above gpt5 at 1/3 pricing). And it seems that -flash is a separate architecture in itself, so no cheeky distillation here. Again, innovations all over the place.

NitpickLawyer · 2025-12-23T21:37:45 1766525865

A good rule of thumb is that PP (Prompt Processing) is compute bound while TG (Token Generation) is (V)RAM speed bound.

NitpickLawyer · 2025-12-23T20:22:11 1766521331

> (The results also show that I need to examine the apparent timing mismatch between the First and Second Editions.)

Something something, naming things, cache invalidation, timestamp mismatches and off-by-1 errors :)

userbinator · 2025-12-23T21:10:16 1766524216

It could just be as simple as being a later "final" copy of V1, made when it was "done".

NitpickLawyer · 2025-12-22T19:37:05 1766432225

> Shows how much more work there is still to be done in this space.

This is why I roll my eyes every time I read doomer content that mentions an AI bubble followed by an AI winter. Even if (and objectively there's 0 chance of this happening anytime soon) everyone stops developing models tomorrow, we'll still have 5+ years of finding out how to extract every bit of value from the current models.

agumonkey · 2025-12-22T21:32:38 1766439158

One thing though, if the slowdown is too abrupt, it might forbid openai, anthropic etc to keep financially running datacenters for us to use.

imiric · 2025-12-22T20:11:25 1766434285

The idea that this technology isn't useful is as ignorant as thinking that there is no "AI" bubble.

Of course there is a bubble. We can see it whenever these companies tell us this tech is going to cure diseases, end world hunger, and bring global prosperity; whenever they tell us it's "thinking", can "learn skills", or is "intelligent", for that matter. Companies will absolutely devalue and the market will crash when the public stops buying the snake oil they're being sold.

But at the same time, a probabilistic pattern recognition and generation model can indeed be very useful in many industries. Many of our problems can be approached by framing them in terms of statistics, and throwing data and compute at them.

So now that we've established that, and we're reaching diminishing returns of scaling up, the only logical path forward is to do some classical engineering work, which has been neglected for the past 5+ years. This is why we're seeing the bulk of gains from things like MCP and, now, "agents".

NitpickLawyer · 2025-12-22T20:21:36 1766434896

> This is why we're seeing the bulk of gains from things like MCP and, now, "agents".

This is objectively not true. The models have improved a ton (with data from "tools" and "agentic loops", but it's still the models that become more capable).

Check out [1] a 100 LoC "LLM in a loop with just terminal access", it is now above last year's heavily harnessed SotA.

> Gemini 3 Pro reaches 74% on SWE-bench verified with mini-swe-agent!

[1] - https://github.com/SWE-agent/mini-swe-agent

imiric · 2025-12-22T20:49:47 1766436587

I don't understand. You're highlighting a project that implements an "agent" as a counterargument to my claim that the bulk of improvements are from "agents"?

Sure, the models themselves have improved, but not by the same margins from a couple of years ago. E.g. the jump from GPT-3 to GPT-4 was far greater than the jump from GPT-4 to GPT-5. Currently we're seeing moderate improvements between each release, with "agents" taking up center stage. Only corporations like Google are still able to squeeze value out of hyperscale, while everyone else is more focused on engineering.

losvedir · 2025-12-23T05:21:04 1766467264

They're pointing out that the "agent" is just 100 lines of code with a single tool. That means the model itself has improved, since such a bare bones agent is little more than invoking the model in a loop.

imiric · 2025-12-23T07:37:40 1766475460

That doesn't make sense, considering that the idea of an "agentic workflow" is essentially to invoke the model in a loop. It could probably be done in much less than 100 lines.

This doesn't refute the fact that this simple idea can be very useful. Especially since the utility doesn't come from invoking the model in a loop, but from integrating it with external tools and APIs, all of which requires much more code.

We've known for a long time that feeding the model with high quality contextual data can improve its performance. This is essentially what "reasoning" is. So it's no surprise that doing that repeatedly from external and accurate sources would do the same thing.

In order to back up GP's claim, they should compare models from a few years ago with modern non-reasoning models in a non-agentic workflow. Which, again, I'm not saying they haven't improved, but that the improvements have been much more marginal than before. It's surprising how many discussions derail because the person chose to argue against a point that wasn't being made.

losvedir · 2025-12-23T17:07:19 1766509639

The original point was that the previous SotA was a "heavily harnessed" agent, which I took to mean it had more tools at its disposal and perhaps some code to manage context and so on. The fact that the model can do it now in just 100 LoC and a terminal tool implied the model itself has improved. It's gotten better at standard terminal commands at least, and possibly bigger context window or more effectively using the data in its context window.

Those are improvements to the model, albeit in service of agentic workflows. I consider that distinct from improvements to agents themselves which are things like MCP, context management, etc.

IanCal · 2025-12-22T21:55:24 1766440524

I think the point here is that it’s not adding agents on top but the improvements in the models allow the agentic flow.

emp17344 · 2025-12-23T03:15:45 1766459745

But that’s not true, and the linked agentic design is not a counterargument to the poster above. The LLM is a small part of the agentic system.

IanCal · 2025-12-23T09:24:36 1766481876

LLMs have absolutely got better at longer horizon tasks.

jameslk · 2025-12-23T08:06:44 1766477204

Useful technology can still create a bubble. The internet is useful but the dotcom bubble still occurred. There’s expectations around how much the invested capital will see a return and growing opportunity cost if it doesn’t, and that’s what creates concerns about a bubble. If a bubble bursts, the capital will go elsewhere, and then you’ll have an “AI winter” once again

NitpickLawyer · 2025-12-22T18:57:48 1766429868

As with many other things (em dashes, emojis, bullet lists, it's-not-x-it's-y constructs, triple adjectives, etc) seeing any one of them isn't a tell. Seeing all of them, or many of them in a single piece of content, is probably the tell.

When you use these tools you get a knack for what they do in "vanilla" situations. If you're doing a quick prompt, no guidance, no context and no specifics, you'll get a type of answer that checks many of the "smells" above. Getting the same over and over again gets you to a point where you can "spot" this pretty effectively.

pessimizer · 2025-12-22T19:57:37 1766433457

The author did not do this. The author thought it was wonderful, read the entire thing, then on a lark (they "twigged" it) checked out the edit history. They took the lack of it as instant confirmation ("So it’s definitely AI.")

The rest of the blog is just random subjective morality wank with implications of larger implications, constructed by borrowing the central points of a series of popular articles in their entirety and adding recently popular clichés ("why should I bother reading it if you couldn't bother to write it?")

No other explanations about why this was a bad document, or this particular event at all, but lots of self-debate about how we should detect, deal with, and feel about bad documents. All documents written by LLM are assumed to be bad, and no discussion is attempted about degrees of LLM assistance.

If I used AI to write some long detailed plan, I'd end up going back and forth with it and having it remove, rewrite, rethink, and refactor multiple times. It would have an edit history, because I'd have to hold on to old drafts in case my suggested improvements turned out not to be improvements.

The weirdest thing about the article is that it's about the burden of "verification," but it thinks that what people should be verifying is that LLMs had no part in what they've received. The discussion I've had about "verification" when it comes to LLMs is the verification that the content is not buggy garbage filled with inhuman mistakes. I don't care if it's LLM-created or assisted, other than a lot of people aren't reading and debugging their LLM code, and LLMs are dumb. I'm not hunting for em-dashes.

-----

edit: my 2¢; if you use LLMs to write something, you basically found it. If you send it to me, I want to read your review of it i.e. where you think it might have problems and why you think it would help me. I also want to hear about your process for determining those things.

People are confusing problems with low-effort contributors with problems with LLMs. The problem with low-effort contributors is that what they did with the LLM was low-effort and isn't saving you any work. You can also spend 5 minutes with the LLM. If you get some good LLM output that you think is worth showing to me, and you think it would take significant effort for me to get it myself, give me the prompts. That's the work you did, and there's nothing wrong with being proud of it.

satisfice · 2025-12-23T04:32:08 1766464328

You may be missing the point. The author’s feeling about the plan he was sent were predicated on an assumption that he thought was safe— that his co-worker had written the document that he claimed to have “put together.”

If you order a meal at a restaurant and later discover that the chicken you ate was recycled from another diner’s table (waste not want not!) you would likely be outraged. It doesn’t matter if it tasted good.

As soon as you tell me you used AI to produce something, you force me review it carefully, unless your reputation for excellent review of your own is well established. Which it probably isn’t— because you are the kind of guy who uses AI to do his work.

NitpickLawyer · 2025-12-22T18:18:39 1766427519

It's not just that. There's a lot of (maybe useful) info that's lost without the entire session. And even if you include a jsonl of the entire session, just seeing that is not enough. It would be nice to be able to "click" at some point and add notes / edit / re-run from there w/ changes, etc.

Basically we're at a point where the agents kinda caught up to our tooling, and we need better / different UX or paradigms of sharing sessions (including context, choices, etc)

NitpickLawyer · 2025-12-21T20:15:40 1766348140

> Beowulf mentions are all referencing the Old English epic poem

Knowing the HN crowd, it can also be a reference to beowulf clusters as well.

Rebelgecko · 2025-12-21T20:38:57 1766349537

This isn't slashdot :)

NitpickLawyer · 2025-12-21T16:04:16 1766333056

A 3rd alternative is to use the best of both worlds. Have the model respond in free-form. Then use that response + structured output APIs to ask it for json. More expensive, but better overall results. (and you can cross-check between your heuristic parsing vs. the structured output, and retry / alert on miss-matches)

theoli · 2025-12-21T18:18:20 1766341100

I am doing this with good success parsing receipts with ministral3:14b. The first prompt describes the data being sought, and asks for it to be put at the end of the response. The format tends to vary between json, bulleted lists, and name: value pairs. I was never able to find a good way to get just JSON.

The second pass is configured for structured output via guided decoding, and is asked to just put the field values from the analyzer's response into JSON fitting a specified schema.

I have processed several hundred receipts this way with very high accuracy; 99.7% of extracted fields are correct. Unfortunately it still needs human review because I can't seem to get a VLM to see the errors in the very few examples that have errors. But this setup does save a lot of time.

NitpickLawyer · 2025-12-21T08:53:31 1766307211

> rightfully criticised because it steals from artists. Generative AI for source code learns from developers

The double standard here is too much. Notice how one is stealing while the other is learning from? How are diffusion models not "learning from all the previous art"? It's literally the same concept. The art generated is not a 1-1 copy in any way.

oneeyedpigeon · 2025-12-21T09:39:03 1766309943

IMO, this is key to the issue, learning != stealing. I think it should be acceptable for AI to learn and produce, but not to learn and copy. If end assets infringe on copyright, that should be dealt with the same whether human- or AI-produced. The quality of the results is another issue.

magicalhippo · 2025-12-21T10:12:33 1766311953

> I think it should be acceptable for AI to learn and produce, but not to learn and copy.

Ok but that's just a training issue then. Have model A be trained on human input. Have model A generate synthetic training data for model B. Ensure the prompts used to train B are not part of A's training data. Voila, model B has learned to produce rather than copy.

Many state of the art LLMs are trained in such a two-step way since they are very sensitive to low-quality training data.

thatswrong0 · 2025-12-21T15:15:23 1766330123

> The art generated is not a 1-1 copy in any way.

Yeah right. AI art models can and have been used to basically copy any artist’s style many ways that make the original actual artist’s hard work and effort in honing their craft irrelevant.

Who profits? Some tech company.

Who loses? The artists who now have to compete with an impossibly cheap copy of their own work.

This is theft at a massive scale. We are forcing countless artists whose work was stolen from them to compete with a model trained on their art without their consent and are paying them NOTHING for it. Just because it is impressive doesn’t make it ok.

Shame on any tech person who is okay with this.

NeutralCrane · 2025-12-21T16:19:28 1766333968

Copying a style isn’t theft, full stop. You can’t copyright style. As an individual, you wouldn’t be liable for producing a work of art that is similar in style to someone else’s, and there is an enormous number of artists today whose livelihood would be in jeopardy if that was the case.

Concerns about the livelihood of artists or the accumulation of wealth by large tech megacorporations are valid but aren’t rooted in AI. They are rooted in capitalism. Fighting against AI as a technology is foolish. It won’t work, and even if you had a magic wand to make it disappear, the underlying problem remains.

tstrimple · 2025-12-22T05:26:53 1766381213

It's almost like some of these people have never seen artists work before. Taping up photos and cutouts of things that inspire them before starting on a project. This is especially true of concept artists who are trying to do unique things while sticking to a particular theme. It's like going to Etsy for ideas for projects you want to work on. It's not cheating. It's inspiration.

blackbrokkoli · 2025-12-21T08:57:51 1766307471

It's a double standard because it's apples and oranges.

Code is an abstract way of soldering cables in the correct way so the machine does a thing.

Art eludes definition while asking questions about what it means to be human.

danielbln · 2025-12-21T09:28:03 1766309283

I love that in these discussions every piece of art is always high art and some comment on the human condition, never just grunt-work filler, or some crappy display ad.

Code can be artisanal and beautiful, or it can be plumbing. The same is true for art assets.

NitpickLawyer · 2025-12-21T09:39:18 1766309958

Exactly! Europa Universalis is a work of art, and I couldn't care less if the horse that you can get as one of your rulers is aigen or not. The art is in the fact that you can get a horse as your ruler.

viraptor · 2025-12-21T12:00:38 1766318438

In this case it's this amazing texture of newspapers on a pole: https://rl.bloat.cat/preview/pre/bn8bzvzd80ye1.jpeg?width=16... Definitely some high art there.

kome · 2025-12-21T11:04:38 1766315078

I agree, computer graphics and art were sloppified, copied and corporate way before AI, so pulling a casablanca "I'm shocked, shocked to find that AI is going on in here!" is just hypocritical and quite annoying.

IshKebab · 2025-12-21T09:37:56 1766309876

Yeah this was probably for like a stone texture or something. It "eludes definition while asking questions about what it means to be human".

perching_aix · 2025-12-21T09:50:08 1766310608

That's a fun framing. Let me try using it to define art.

Art is an abstract way of manipulating aesthetics so that the person feels or thinks a thing.

Doesn't sound very elusive nor wrong to me, while remaining remarkably similar to your coding definition.

> while asking questions about what it means to be human

I'd argue that's more Philosophy's territory. Art only really goes there to the extent coding does with creativity, which is to say

> the machine does a thing

to the extent a programmer has to first invent this thing. It's a bit like saying my body is a machine that exists to consume water and expel piss. It's not wrong, just you know, proportions and timing.

This isn't to say I classify coding and art as the same thing either. I think one can even say that it is because art speaks to the person while code speaks to the machine, that people are so much more uppity about it. Doesn't really hit the same as the way you framed this though, does it?

surgical_fire · 2025-12-21T11:05:43 1766315143

Are you telling me that, for example, rock texture used in a wall is "asking questions about what it means to be human"?

If some creator with intentionality uses an AI generated rock texture in a scene where dialogue, events, characters and angles interact to tell a story, the work does not ask questions about what it means to be human anymore because the rock texture was not made by him?

And in the same vein, all code is soldering cables so the machine does a thing? Intentionality of game mechanics represented in code, the technical bits to adhere or work around technical constraints, none of it matters?

Your argument was so bad that it made me reflexively defend Gen AI, a technology that for multiple reasons I think is extremely damaging. Bad rationale is still bad rationale though.

tpmoney · 2025-12-21T11:37:09 1766317029

> Art eludes definition while asking questions about what it means to be human.

All art? Those CDs full of clip art from the 90's? The stock assets in Unity? The icons on your computer screen? The designs on your wrapping paper? Some art surely does "[elude] definition while asking questions about what it means to be human", and some is the same uninspired filler that humans have been producing ever since the first the first teenagers realized they could draw penis graffiti. And everything else is somewhere in between.

Jensson · 2025-12-21T09:37:06 1766309826

The images clair obscur generated hardly "eludes definition while asking questions about what it means to be human.".

The game is art according to that definition while the individual assets in it are not.

booleandilemma · 2025-12-21T10:10:44 1766311844

You're just someone who can't see the beauty of an elegant algorithm.

saubeidl · 2025-12-21T09:29:03 1766309343

Speak for yourself.

I consider some code I write art.

theshrike79 · 2025-12-21T12:14:01 1766319241

The obfuscated C competition is definitely art

NitpickLawyer · 2025-12-20T17:33:31 1766252011

It would be trivial to create something like this but there are a few major problems with running such a platform that I think makes it not worth while for anyone (maybe some providers will try it, but it's still tough).

- you will be getting a TON of spam. Just look at all the MCP folks, and how they're spamming everywhere with their claude-vibed mcp implementation over something trivial.

- the security implications are enormous. You'd need a way to vet stuff, moderate, keep track of things and so on. This only compounds with more traffic, so it'd probably be untenable really fast.

- there's probably 0 money in this. So you'd have to put a lot of work in maintaining a platform that attracts a lot of abuse/spam/prompt kiddies, while getting nothing in return. This might make sense to do for some companies that can justify this cost, but at that point, you'd be wondering what's in it for them. And what control do they exert on moderation/curation, etc.

I think the best we'll get in this space is from "trusted" entities (i.e. recognised coders / personalities / etc), from companies themselves (having skills in repos for known frameworks might be a thing, like it is with agents.md), and maybe from the token providers themselves.