> ... I’ll keep pulling PRs locally, adding more git hooks to enforce code quality, and zooming through coding tasks—only to realize ChatGPT and Claude hallucinated library features and I now have to rip out Clerk and implement GitHub OAuth from scratch.
I don't get this, how many git hooks do you need to identify that Claude had hallucinated a library feature? Wouldn't a single hook running your tests identify that?
I don't have a ton of tests. From what I've seen, Claude will often just update the tests to no-op so tests passing isn't trustworthy.
My workflow is often to plan with ChatGPT and what I was getting at here is ChatGPT can often hallucinate features of 3rd party libraries. I usually dump the plan from ChatGPT straight into Claude Code and only look at the details when I'm testing.
That said, I've become more careful in auditing the plans so I don't run in to issues like this.
Tell Claude to use a code review sub agent after every significant change set, tell them to run the tests and evaluate the change set, don't tell Claude it wrote the code, and give them strict review instructions. Works like a charm.
Yes. Go on ChatGPT, explain what you're doing (claude code, trying to get it to be more rigorous with itself and reduce defects) then click deep research and tell it you'd like it to look up code review best practices, AI code review, smells/patterns to look out for in AI code, etc. Then have it take the result of that and generate a XML structured document with a flowchart of the code review best practices it discovered, cribbing from an established schema for element names/attributes when possible, and put it in fenced xml blocks in your subagent. You can also tell claude code to do deep research, you just have to be a little specific about what it should go after.
cool, can you think of any differences between a human engineer, who is presumably employed by an employer and subject to review and evaluation by a manager and inherently assumed to be capable of receiving feedback and reliably applying it on a go-forward basis to their future work, and an LLM, when they each make this same kind of mistake?
the difference between an arbitrary LLM and a human engineer is completely described by the salary you would pay to the human engineer? in all other dimensions they are indistinguishable? nice, super cool
I don't get this, how many git hooks do you need to identify that Claude had hallucinated a library feature? Wouldn't a single hook running your tests identify that?