Hacker Newsnew | past | comments | ask | show | jobs | submit | sly010's commentslogin

There is a 3rd party Android app that uses the accessibility APIs to (supposedly) track and limit my short video use. However, it's broken, so I can't watch short videos at all :)


We get it. You are a manager.


I've seen this image generated by meta AI. The prompt was something like: think of a room, make it look like anything you like, but do not in any circumstance put a clown in it. Guess what...

I think Jason has a "do not think of an elephant" problem.


Sorry for the snark, but we couldn't even do this for humans, but let's do it for poor poor LLMs? It's kind of ironic that NOW is the time we worry about usability. What happened to RTFM?


Genuine question: How does this work? How does an LLM do object detection? Or more generally, how does an LLM do anything that is not text? I always thought tasks like this are usually just handed to an other (i.e. vision) model, but the post talks about it as if it's the _same_ model doing both text generation and vision. It doesn't make sense to me why would Gemini 2 and 2.5 would have different vision capabilities, shouldn't they both have access to the same, purpose trained state of the art vision model?


You tokenize the image and then pass it through a vision encoder that is generally trained separately from large scale pretraining (using say contrastive captioning) and then added to the model during RLHF. I’m not surprised if the vision encoder is used in pre training now too, this will be a different objective than next token prediction of course (unless they use something like next token prediction for images which I don’t think is the case).

Different models have different encoders, they are not shared as the datasets across models and even model sizes vary. So performance between models will vary.

What you seem to be thinking is that text models were simply calling an API to a vision model, similar to tool-use. That is not what’s happening, it is much more inbuilt, the forward pass is going through the vision architecture to the language architecture. Robotics research has been doing this for a while.


They might use YouTube; there's next-frame prediction and multimodal grounding via subtitles and audio available.

IIUC they got the native voice2voice models trained on YT-sourced audio. Skipping any intermediate text form is really helpful for fuzzy speech such as from people slurring/mumbling words. Also having access to a full world model during voice-deciphering obviously helps with situations that are very context-heavy, such as for example (spoken/Kana/phonetic) Japanese (which relies on human understanding of context to parse homophones, and non-phonetic Han (Kanji) in writing to make up for the inability to interject clarification).


> I always thought tasks like this are usually just handed to an other (i.e. vision) model, but the post talks about it as if it's the _same_ model doing both text generation and vision.

Most vision LLMs don't actually use a separate vision model. https://huggingface.co/blog/vlms is a decent explanation of what's going on.

Most of the big LLMs these days are vision LLMs - the Claude models, the OpenAI models, Grok and most of the Gemini models all accept images in addition to text. To my knowledge none of them are using tool calling to a separate vision model for this.

Some of the local models can do this too - Mistral Small and Gemma 3 are two examples. You can tell they're not tool calling to anything because they run directly out of a single model weights file.


Not a contradiction to anything you said, but O3 will sometimes whip up a python script to analyse the pictures I give it.

For instance, I asked it to compute the symmetry group of a pattern I found on a wallpaper in a Lebanese restaurant this weekend. It realised it was unsure of the symmetries and used a python script to rotate and mirror the pattern and compare to the original to check the symmetries it suspected. Pretty awesome!


It used to be done that way, but newer multimodal LLMs train on a mix of image and text tokens, so they don’t need a separate image encoder. There is just one model that handles everything.


If you have 20 minutes, this is a very good video

https://www.youtube.com/watch?v=EzDsrEvdgNQ


tokens are tokens


Nah. You misunderstood. "They" don't make money on human time wasted. They make money on ads served. They don't particularly care if the ads were served to humans or agents, they get paid either way. Bot-traffic is actually good for tech companies because it inflates numbers. Capthas are not there to waste our time, but are there to improve their credibility ("We are certain those ad-clicks were real humans because the captha said so").


Plenty of apps that don't have ads nevertheless chase "engagement" and will do everything possible to thwart automated/efficient usage.


First off, I always thought the type of things described (tracking mouse movements, keypress jitter, etc) are already done by ReCacpha to decide when to present the user with a captcha. I am surprised they are not already doing this.

Second, I am surprised AI agents are this naive. I thought they would emulate human behavior better.

In fact, just based on this article, very little effort has been put into this race on either side.

So I wonder if is has to do with the fact that if companies like google reliably filtered out bot traffic, they would loose 90% of their AD revenue. This way they have plausible deniability.


They were very proud of this mouse movement stuff when the desktop was 70% of traffic.. It's not worth as much investment as its been given since there's no group limiting people to one HID method and removing accessibility from world.


Math has a PR problem. The weight being non-uniform makes this a little unsurprising to a non-mathematician, it's a bit like a wire "sphere" with a weight attached on one side, but a low poly version. Giving it a "skin" would make this look more impressive.


It appears unsurprising because it is unsurprising.


And just like optimizing compilers LLMs also emit code that is difficult to verify and no-one really understands, so when the shit hits the fan you have no idea what's going on.


Is it though? Most code that LLM emits are easier to understand than equivalent code by humans in my experience, helped by overt amount of comment added at every single step.

That's not to say the output is correct, there are usually bugs and unnecessary stuff if the logic generated isn't trivial, but reading it isn't the biggest hurdle.

I think you are referring to the situation where people just don't read the code generated at all.. in that case it's not really LLM's fault.


> Most code that LLM emits are easier to understand than equivalent code by humans in my experience

Even if this were true, which I strongly disagree with, it actually doesn't matter if the code is easier to understand

> I think you are referring to the situation where people just don't read the code generated at all.. in that case it's not really LLM's fault

It may not be the LLM's "fault", but the LLM has enabled this behavior and therefore the LLM is the root cause of the problem


Very cool, but by css-rotating (skewY(-6deg)) the canvas at the last moment, you introduced aliasing on the border between the canvas and the rest of the page which kills the vibe. The browser can't automatically blend the canvas with the rest of the page. It's noticeable even on a brand new retina display. Maybe you could keep your canvas square and introduce the skew in the shader.


The funny thing is, as far as I know, skewY is a virtual draw command in the WebKit family of rendering engines.

It's "in the shader" already. For whatever reason, your browser's compositor is failing to anti-alias the rendering bounds of the canvas.

I don't know why, though. I don't see the issue in Safari on my system.


As a workaround, you can add a transparent border (border: 2px solid transparent) around the skewed element to have antialiasing (at least on chrome)


Guess it depends on the browser as it looks sharp and free of aliasing for me, including when zooming in (Opera on Android)


  - Safari: decent but still obviously present
  - Chrome: quite bad looking
  - Firefox: something in between
(tested on macOS)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: