I started like this. Then I came around and can’t imagine going back. It’s kinda...

DeathArrow · on Oct 29, 2024

>My best example is test cases. I can write a method in 3 minutes, but Sonnet will write the 8 best test cases in 4 seconds

For me it doesn't work. Generated tests fail to run or they fail.

I work in large C# codebases and in each file I have lots of injected dependencies. I have one public method which can call lots of private methods in the same class.

AI either doesn't properly mock the dependencies, either ignores what happens in the private methods.

If I take a lot of time guiding it where to look, it can generate unit tests that pass. But it takes longer than if I write the unit tests myself.

dep_b · on Oct 29, 2024

For me it's the same. It's usually just some hallucinated garbage. All of these LLM's don't have the full picture of my project.

When I can give them isolated tasks like convert X to Y, create a foo that does bar it's excellent, but for unit testing? Not even going to try anymore. I write 5 unit tests manually that work in the time I write 5 prompts that give me useless stuff that I need to add manually.

Why can't we have a LLM cache for a project just like I have a build cache? Analyze one particular commit on the main branch very expensively, then only calculate the differences from that point. Pretty much like git works, just for your model.

ssijak · on Oct 29, 2024

"It's usually just some hallucinated garbage. All of these LLM's don't have the full picture of my project."

Cursor can have whole project in the context, or you can specify specific files that you want.

dartos · on Oct 29, 2024

> Cursor can have whole project in the context

Depends on the size of the project. You can’t shove all of google’s monorepo into an LLMs context (yet)

dep_b · on Oct 30, 2024

I’m looking at 150000 lines of Swift divided over some local packages and the main app, excluding external dependencies

azan_ · on Oct 30, 2024

Do you have 150000 lines of Swift in YOUR context window?

dep_b · on Oct 31, 2024

I know how to find the context I need, being aided by the IDE and compiler. So yes, my context window contains all of the code in my project, even if it's not instantaneous.

It's not that hard to have an idea of what code is defined where in a project, since compilers have been doing that for over half a century. If I'm injecting protocols and mocks into a unit test, it shouldn't be really hard for a computer to figure out their definitions, unless they don't exist yet and I was not clear they should have been created, which would mean that I'm giving the AI the wrong prompt and the error is on my side.

cutemonster · on Oct 30, 2024

> Why can't we have a LLM cache for a project just like I have a build cache? Analyze one particular commit on the main branch very expensively

It's not just very expensive - it's prohibitively expensive, I think.

scosman · on Oct 29, 2024

With Cursor you can specify which files it reads before starting. Usually have to attached one or two to get an ideal one-shot result.

But yeah, I use it for unit testing, not integration testing.

throwup238 · on Oct 29, 2024

Ask Cursor to write usage and mocking documentation for the most important injected dependencies, then include that documentation in your context. I’ve got a large tree of such documentation in my docs folder specifically for guiding AI. Cursor’s Notebook feature can bundle together contexts.

I use Cursor to work on a Rust Qt app that uses the main branch of cxx-qt so it’s definitely not in the training data, but Claude figures out how to write correct Rust code based on the included documentation no problem, including the dependency injection I do through QmlEngine.

rubymamis · on Oct 29, 2024

Sounds interesting, what are you working on?

(Fellow Qt developer)

throwup238 · on Oct 29, 2024

Same thing: https://news.ycombinator.com/item?id=40740017 :)

Just saw you published your block editor blog post. Look forward to reading it!

rubymamis · on Oct 31, 2024

Haha, hi again!

Awesome! Would love to hear your thoughts. Any progress on your AI client? I'm intrigued by the so many bindings to Qt. Recently, I got excited about a Mojo binding[1].

[1] https://github.com/rectalogic/mojo-qt

tyre · on Oct 29, 2024

I’ve found it better at writing tests because it tests the code you’ve written vs what you intended. I’ve caught logic bugs because it wrote tests with an assertion for a conditional that was backwards. The readable name of the test clearly pointed out that I was doing the wrong thing (the test passed?.)

scosman · on Oct 29, 2024

Interesting. I’ve had the opposite experience (I invert or miss a condition, it catches it).

It probably comes down to model, naming and context. Until Sonnet 3.5 my experience was similar to yours. After it mostly “just works”.

williamdclt · on Oct 29, 2024

That sounds more like a footgun than a desirable thing to be honest!

scosman · on Oct 29, 2024

Maybe a TLDR from all the issues I'm reading in this thread:

- It's gotten way better in the last 6 months. Both models (Sonnet 3.5 and new October Sonnet 3.5), and tooling (Cursor). If you last tried Co-pilot, you should probably give it another look. It's also going to keep getting better. [1]

- It can make errors, and expect to do some code review and guiding. However the error rates are going way way down [1]. I'd say it's already below humans for a lot of tasks. I'm often doing 2/3 iterations before applying a diff, but a quick comment like "close, keep the test cases, but use the test fixture at the top of the file to reduce repeated code" and 5 seconds is all it takes to get a full refactor. Compared to code-review turn around with a team, it's magic.

- You need to learn how to use it. Setting the right prompts, adding files to the context, etc. I'd say it's already worth learning.

- I just knows the docs, and that's pretty invaluable. I know 10ish languages, which also means I don't remember the system call to get an env var in any of them. It does, and can insert it a lot faster than I can google it. Again, you'll need to code review, but more and more it's nailing idiomatic error checking in each language.

- You don't need libraries for boilerplate tasks. zero_pad is the extreme/joke example, but a lot more of my code is just using system libraries.

- It can do things other tools can't. Tell it to take the visual style of one blog post and port to another. Take it to use a test file I wrote for style reference, and update 12 other files to follow that style. Read the README and tests, then write pydocs for a library. Write a GitHub action to build docs and deploy to GitHub pages (including suggesting libraries, deploy actions, and offering alternatives). Again: you don't blindly trust anything, you code review, and tests are critical.

[1] https://www.anthropic.com/news/3-5-models-and-computer-use

DeathArrow · on Oct 29, 2024

Yes, it works for new code and simple cases. If you have large code bases, it doesn't have the context and you have to baby it, telling which files and functions it should look into before attempting to write something. That takes a lot of time.

Yes, it can do simple tasks, like you said, writing a call to get the environment variables.

But imagine you work on a basket calculation service, where you have base item prices, where you have to apply some discounts based on some complicated rules, you have to add various kinds of taxes for various countries in the world and you have to use a different number of decimals for each country. Each of your classes calls 5 to 6 other classes, all with a lot of business logic behind. Besides that, you also make lots of API calls to other services.

What will the AI do for you? Nothing, it will just help you write one liners to parse or split strings. For everything else it lacks context.

JamesSwift · on Oct 29, 2024

Are you suggesting you would inline all that logic if you hand-rolled the method? Probably not, right? You would have a high-level algorithm of easily-understood parts. Why wouldnt the AI be able to 1) write that high-level algorithm and then 2) subsequently write the individual parts?

scosman · on Oct 29, 2024

What's the logic here? "I haven't seen it so it doesn't exist?"

There are hundreds of available examples of it processing large numbers of files, and making correct changes across them. There are benchmarks with open datasets already linked in the thread [1]. It's trivial to find examples of it making much more complex changes than "one liners to parse or split strings".

[1] https://huggingface.co/datasets/princeton-nlp/SWE-bench

throwup238 · on Oct 29, 2024

> Instant and pretty great code review: it can understand what you are trying to do, find issues, and fix them quickly. Just ask it to review and fix issues.

Cursor’s code review is surprisingly good. It’s caught many bugs for me that would have taken a while to debug, like off by one errors or improperly refactored code (like changing is_alive to is_dead and forgetting to negate conditionals)

kurko · on Oct 29, 2024

> changing is_alive to is_dead and forgetting to negate conditionals

No test broke?

flappyeagle · on Oct 31, 2024

Tests don’t care what you name the variable

froobrad · on Nov 1, 2024

This “really smart new grad” take is completely insane to me, especially if you know how LLMs work. Look at this SQL snippet Claude (the new Sonnet) generated recently.

    -- Get recipient's push token and sender's username
    SELECT expo_push_token, p.username 
    INTO recipient_push_token, sender_username
    FROM profiles p
    WHERE p.id = NEW.recipient_id;

Seems like the world has truly gone insane and engineers are tuned into some alternate reality a la Fox News. Well…it’ll be a sobering day when the other shoe falls.

globnomulous · on Oct 30, 2024

> it can understand

It can't understand. That's not what LLMs do.

golol · on Oct 30, 2024

This is a prompt I gave to o1-mini a while ago: My instructions follow now. The scripts which I provided you work perfectly fine. I want you to perform a change though. The image_data.pkl and faiss_index.bin are two databases consisting of rows, one for each image, in the end, right? My problem is that there are many duplicates: images with different names but the same content. I want you to write a script which for each row, i.e. each image, opens the image in python and computes the average expected color and the average variation of color, for each of the colors red, green and blue, and over "random" over all the pixels. Make sure that this procedure is normalized with respect to the resolution. Then once this list of "defining features" is obtained, we can compute the pairwise difference. If two images have less than 1% variation in both expectation and variation, then we consider them to be identical. in this case, delete those rows/images, except for one of course, from the .pkl and the .bin I mentioned in the beginning. Write a log file at the end which lists the filenames of identical images.

It wrote the script, I ran it and it worked. I had it write another script which displays the found duplicate groups so I could see at a glance that the script had indeed worked. And for you this does not constitute any understanding? Yes it is assembling pieces of code or algorithmic procedures which it has memorized. But in this way it creates a script tailored to my wishes. The key is that it has to understand my intent.

globnomulous · on Nov 1, 2024

Does "it understands" just mean "it gave me what I wanted?" If so, I think it's clear that that just isn't understanding.

Understanding is something a being has or does. And understanding isn't always correct. I'm capable of understanding. My calculator isn't. When my calculator returns a correct answer, we don't say it understood me -- or that it understands anything. And when we say I'm wrong, we mean something different from what we mean when we say a calculator is wrong.

When I say LLMs can't understand, I'm saying they're no different, in this respect, from a calculator, WinZip when it unzips an archive, or a binary search algorithm when you invoke a binary-search function. The LLM, the device, the program, and the function boil down (or can) to the same primitives and the same instruction set. So if LLMs have understanding, then necessarily so do a calculator, WinZip, and a binary-search algorithm. But they don't. Or rather we have no reason to suppose they do.

If "it understands" is just shorthand for "the statistical model and program were designed and tuned in such a way that my input produced the desired output," then "understand" is, again, just unarguably the wrong word, even as shorthand. And this kind of shorthand is dangerous, because over and over I see that it stops being shorthand and becomes literal.

LLMs are basically autocorrect on steroids. We have no reason to think they understand you or your intent any more than your cell phone keyboard does when it guesses the next character or word.

globnomulous · on Nov 1, 2024

When I look at an image of a dog on my computer screen, I don't think that there's an actual dog anywhere in my computer. Saying that these models "understand" because we like their output is, to me, no different from saying that there is, in fact, a real, actual dog.

"It looks like understanding" just isn't sufficient for us to conclude "it understands."

marco_craveiro · on Oct 30, 2024

I think the problem is our traditional notions of "understanding" and "intelligence" fail us. I don't think we understand what we mean by "understanding". Whatever the LLM is doing inside, it's far removed from what a human would do. But on the face of it, from an external perspective, it has many of the same useful properties as if done by a human. And the LLM's outputs seem to be converging closer and closer to what a human would do, even though there is still a large gap. I suggest the focus here shouldn't be so much on what the LLM can't do but the speed at which it is becoming better at doing things.

golol · on Oct 30, 2024

I think there is only one thing we should focus on: Measurable capability on tasks. Understanding, memorization, reasoning etc. are all just shorthands we use to quickly convey an idea of a capability on a kind of task. Measurable capability on tasks can also attempt do describe mechanistically how the model works, but that is very difficult. This is where you would try to describe your sense of "understanding" rigorously. To keep it simple for example, I think when you say that the LLM does not understand what you must really mean is that you reckon its performance will quickly decay off as the task gets more difficult in various dimensions: Depth/complexity, Verifiability of the result, length/duration/context size, to a degree where it is still far from being able to act as a labor-delivering agent.

flappyeagle · on Oct 31, 2024

Brains can’t understand either that’s not what neurons do

globnomulous · on Nov 1, 2024

We experience our own minds and we have every reason to think that our minds are a direct product of our brains.

We don't have any reason to think that these models produce being, awareness, intention, or experience.

sebastiansm · on Oct 29, 2024

What is the best workflow to code with an AI?

Copy and paste the code to the Claude website? Or use an extension? o something else?

scosman · on Oct 29, 2024

Cursor. Mostly chat mode. Usually adding 1-2 extra files to the context before invoking, and selecting the relevant section for extra focus.

paulluuk · on Oct 29, 2024

I personally use copilot, which is integrated into my IDE, almost identical to this Cursor example.

dham · on Oct 29, 2024

Copilot is about as far away from Cursor with Claude as the Wright Brothers' glider is to the Saturn V.

paulluuk · on Nov 1, 2024

Not based on the link, I didn't see anything in that text that I can't do with copilot or which looked better to me than what copilot outputs.

ukuina · on Oct 29, 2024

Does Copilot do multi-file edits now?

ukuina · on Oct 31, 2024

Copilot Editor is a beta feature that can perform multi-file edits.

scosman · on Oct 29, 2024

Another fun example from yesterday: pasted a blog post in markdown into a HTML comment. Selected it and told sonnet to convert it to HTML using another blog post as a style reference.

Done in 5 seconds.

kccqzy · on Oct 29, 2024

And how do you trust that it didn't just alter or omit some sentences from your blog post?

I just use Pandoc for that purpose and it takes 30 seconds, including the time to install pandoc. For code generation where you'll review everything, AI makes sense; but for such conversion tasks, it doesn't because you won't review the generated HTML.

TeMPOraL · on Oct 29, 2024

> it takes 30 seconds, including the time to install pandoc

On some speedrunning competition maybe? Just tested on my work machine, `sudo apt-get pandoc` took 11 seconds to complete, and it was this fast only because I already had all the depndencies installed.

Also I don't think you'll be able to fulfill the "using another blog post as a style reference" part of GP's requirements - unless, again, you're some grand-master Pandoc speedrunner.

Sure, AI will make mistakes with such conversion tasks. It's not worth it if you're going to review everything carefully anyway. In code, fortunately, you don't have to - the compiler is doing 90% of the grunt work for you. In writing, depends on context. Some text you can eyeball quickly. Sometimes you can get help from your tool.

Literally yesterday I back-ported a CV from English to Polish via Word's Translation feature. I could've done it by hand, but Word did 90% of it correctly, and fixing the remaining issues was a breeze.

Ultimately, what makes LLMs a good tool for random conversions like these is that it's just one tool. Sure, Pandoc can do GP's case better (if inputs are well-defined), but it can't do any of the 10 other ad-hoc conversions they may have needed that day.

adamc · on Oct 29, 2024

Installing pandoc is basically a one-time cost that is amortized over its uses, so... why worry about it?

Relying on the compiler to catch every mistake is a pretty limited strategy.

TeMPOraL · on Oct 29, 2024

> Installing pandoc is basically a one-time cost that is amortized over its uses, so... why worry about it?

Because space of problems LLMs of today solve well with trivial prompts is vast, far greater than any single classical tool covers. If you're comparing solutions to 100 random problems, you have to count in those one-time costs, because you'll need to use some 50-100 different tools to get through them all.

> Relying on the compiler to catch every mistake is a pretty limited strategy.

No, you're relying on the compiler to catch every mistake than can be caught mechanically - exactly the kind of things humans suck at. It's kind of the entire point of errors and warnings in compilers, or static typing for that matter.

adamc · on Oct 30, 2024

No, if you are having an LLM generate code that you are not reviewing, you are relying on the compiler 100%. (Or the runtime, if it isn't a compiled language.)

TeMPOraL · on Nov 1, 2024

Who said I'm not reviewing? Who isn't reviewing LLM code?

scosman · on Oct 29, 2024

Re:trust. It just works using Sonnet 3.5. It's gained my trust. I do read it after (again, I'm more a code reviewer role). People make mistakes too, and I think it's error rate for repeititve tasks is below most people's. I also learned how to prompt it. I'd tell it to just add formatting without changing content in the first pass. Then in a separate pass ask it to fix spelling/grammar issues. The diffs are easy to read.

Re:Pandoc. Sure, if that's the only task I used it for. But I used it for 10 different ones per day (write a JSON schema for this json file, write a Pydantic validator that does X, write a GitHub workflow doing Y, add syntax highlighting to this JSON, etc). Re:this specific case - I prefer real HTML using my preferred tools (DaisyUI+tailwind) so I can edit it after. I find myself using a lot less boilerplate-saving libraries, and knowing a few tools more deeply.

kccqzy · on Oct 29, 2024

Why are you comparing its error rate for repetitive tasks with most people? For such mechanical tasks we already have fully deterministic algorithms to do it, and the error rate of these traditional algorithms is zero. You aren't usually asking a junior assistant to manually do such conversion, so it doesn't make sense to compare its error rate with humans.

Normalizing this kind of computer errors when there should be none makes the world a worse place, bit by bit. The kind of productivity increase you get from here does not seem worthwhile.

cachehit · on Oct 29, 2024

The OP said they had it use another HTML page as a style reference. Pandoc couldn't do that. Just like millions of other specific tasks.

kccqzy · on Oct 29, 2024

That's just a matter of copying over some CSS. It takes the same effort as copying the output of AI so that's not even taking extra time.

scosman · on Oct 29, 2024

Apply the style of B to A is not deterministic, nor are there prior tools that could do it.

falconertc · on Oct 29, 2024

You didn't also factor in the time to learn Pandoc (and to relearn it if you haven't used it lately). This is also just one of many daily use cases for these tools. The time it takes to know how to use a dozen tools like this adds up when an LLM can just do them all.

kccqzy · on Oct 29, 2024

This is actually how I would use AI: if I forgot how to do a conversion task, I would ask AI to tell me the command so that I can run it without rejiggering my memory first. The pandoc command is literally one line with a few flags; it's easily reviewable. Then I run pandoc myself. Same thing with the multitude of other rarely used but extremely useful tools such as jq.

In other words, I want AI to help me with invoking other tools to do a job rather than doing the job itself. This nicely sidesteps all the trust issues I have.

flir · on Oct 29, 2024

I do that constantly. jq's syntax is especially opaque to me. "I've got some JSON formatted like <this>. Give me a jq command that does <that>.

Google, but better.

scosman · on Oct 30, 2024

lgas · on Oct 29, 2024

> And how do you trust that it didn't just alter or omit some sentences from your blog post?

How do you trust a human in the same situation? You don't, you verify.

kccqzy · on Oct 29, 2024

What? Is this a joke? Have you actually worked with human office assistants? The whole point of human assistants is that you don't need to verify their work. You hire them with a good wage and you trust that they are working in good faith.

It's disorienting for me to hear that some people are so blinded by AI assistants that they no longer know how human assistants behave.

nuancebydefault · on Oct 29, 2024

It appears op has a different experience. Each human assistant is different.