I'm curious in what kinda if situations you are seeing the model the do opposite...

avereveard · 2025-12-22T18:24:13 1766427853

Mostly gemini 3 pro when I ask to investigate a bug and provide fixing options (i do this mostly so i can see when the model loaded the right context for large tasks) gemini immediately starts fixing things and I just cant trust it

Codex and claude give a nice report and if I see they're not considering this or that I can tell em.

saxenaabhi · 2025-12-23T13:38:33 1766497113

fyi that happened to me with codex.

but, why is it a big issue? if it does something bad, just reset the worktree and try again with a different model/agent? They are dirt cheap at 20/m and I have 4 subscription(claude, codex, cursor, zed).

avereveard · 2025-12-25T16:56:46 1766681806

Same I have multiple subscription and layer them. I use haiku to plan and send queue of task to codex and gemini whose command line can be scripted

The issue to me is that I have no idea of what the code looks like and have to have a reliable first layer model that can summarize current codebase state so I can decide whether the next mutation moves the project forward or reduces technical debt. I can delegate much more that way, while gemini "do first" approach tend to result in many dead ends that I have to unravel.

prmph · 2025-12-23T14:45:26 1766501126

The issue is that if it's struggling sometimes with basic instruction following, it's likely to be making insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.

The thing about good abstractions is that you should be able to trust in a composable way. The simpler or more low-level the building blocks, the more reliable you should expect them to be. In LLMs you can't really make this assumption.

saxenaabhi · 2025-12-23T15:21:25 1766503285

I'm not sure you can make that assumption even when a human wrote that code. LLMs are competing with humans not with some abstraction.

> The issue is that if it's struggling sometimes with basic instruction following, it's likely to be making insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.

Yes, that's why we review all code even when written by humans.