More

frabonacci · 2025-12-16T16:00:49 1765900849

We spent the last few months trying to understand why computer-use agents (Claude Computer-Use, OpenAI CUA, Gemini 2.5 Computer-Use) fail so inconsistently.

The pattern we kept seeing: same agent, same task, different OS theme = notably different results.

Claude Sonnet 4 scores 31.9% on OSWorld and Windows Agent Arena (2 of the most relevant benchmarks for computer-use agents) — but with massive variance. An agent trained on Windows 11 light mode fails on dark mode. Works on macOS Ventura, breaks on Monterey. Works on Win11, collapses on Vista.

The root cause: training data lacks visual diversity. Current benchmarks (OSWorld, Windows Agent Arena) rely on static VM snapshots with fixed configurations. They don't capture the reality of diverse OS themes, window layouts, resolution differences, or desktop clutter.

We built cua-bench — HTML-based simulated environments that render across 10+ OS themes (macOS, Win11, WinXP, Win98, Vista, iOS, Android). Define a task once, generate thousands of visual variations.

This enables: - Oracle trajectory generation via a Playwright-like API (verified ground truth for training) - Trajectory replotting: record 1 demo → re-render across 10 OS themes = 10 training trajectories

The technical report covers our approach to trajectory generation, Android/iOS environments, cross-platform HTML snapshots, and a comparison with existing benchmarks.

We’re currently working with research labs on training data generation and benchmarks, but we’d really value input from the HN community: - What tasks or OS environments should be standardized to actually stress computer-use agents? - Legacy OSes? Weird resolutions? Broken themes? Cluttered desktops? Modal hell?

Curious what people here think are the real failure modes we should be benchmarking.

someguy101010 · 2025-12-16T16:31:03 1765902663

as an infrastructure engineer the idea of being able to train computer use agents without provisioning infrastructure sounds amazing!

a common use case i run into is i want to be able to configure corporate vpn software on windows machines. is there a link for a getting started guide i could try this out with?

frabonacci · 2025-12-16T16:37:05 1765903025

Yes, in a simulated environment you can do this today using plain JS and connecting to a real VPN, while driving the desktop UI. No infra provisioning needed.

If you need a real Windows OS + corporate VPN, we also support binding agents to actual Windows sandboxes. This example shows automating a Windows app behind a VPN: https://cua.ai/docs/example-usecases/windows-app-behind-vpn

you'll need to define a new task in the cua-bench registry first though - just sign up on the website for early access!

frabonacci · 2025-12-14T20:02:25 1765742545

The author's conclusion feels even more relevant today: AI automation doesn’t really remove human difficulty—it just moves it around, often making it harder to notice and more risky. And even after a human steps in, there’s usually a lot of follow-up and adjustment work left to do. Thanks for surfacing these uncomfortable but relevant insights

bitwize · 2025-12-14T20:08:23 1765742903

Sanchez's Law of Abstraction comes to mind: https://news.ycombinator.com/item?id=22601623

frabonacci · 2025-12-12T07:30:55 1765524655

Same here. I even blamed it on switching between Italian and Spanish all the time and thought my brain was short-circuiting. But when you see the right key light up and a different letter shows up, something’s clearly off. Also: with battery saver on it’s basically unusable - the lag makes typing way worse. The video was oddly comforting. Turns out I’m not losing it.

frabonacci · 2025-09-26T18:53:17 1758912797

Yeah, a lot of these corporate hackathons are basically just lead gen in disguise. "Use our SaaS product, maybe we’ll give you a t-shirt." They're more about getting conversions than actually teaching anything useful to the students.

frabonacci · 2025-09-25T20:44:15 1758833055

This is a nice first step - web search makes sense, and it’s easy to imagine other tools being added next: filesystem, browser, maybe even full desktop control. Could turn Ollama into more than just a model runner. Curious if they’ll open up a broader tool API for third-party stuff too

frabonacci · 2025-09-25T00:14:49 1758759289

Duplicate https://news.ycombinator.com/item?id=45366942

frabonacci · 2025-09-24T23:56:16 1758758176

Also this comes just a couple of days after a similar incident affected all of Spain

yomismoaqui · 2025-09-25T00:34:16 1758760456

Are you refering to the blocking of Cloudflare when La Liga matches are played? That affects sites that use Cloudflare, but it's not the fault of Dockerhub.

frabonacci · 2025-09-23T14:40:22 1758638422

> Our vision is simple: we want to create a factory that can produce a gigawatt of new AI infrastructure every week.

The moat will be how efficiently you convert electricity into useful behavior. Whoever industrializes evaluation and feedback loops wins the next decade.

frabonacci · 2025-09-16T21:56:45 1758059805

Microsoft has a dozen vertical Copilots to build, so picking the model with the best capability today makes sense. If Claude Code is stronger for dev productivity, using it in VS Code just raises the bar for everything else they ship

frabonacci · 2025-09-15T20:59:36 1757969976

Really thoughtful piece. It reminds me of how Angular once dominated by default, until its complexity and inertia gave space for React. The same dynamic could be repeating now - React’s network effects create stability, but also risk suffocating innovation