Has anyone built something like this using accessibility APIs instead of (or in addition to) OCR? It seems like a waste to OCR everything when you could just get the text directly from the accessibility APIs. Also seems like potentially a good way to connect LLMs to UIs, and something like this would be the way to collect the training data.
Dragon NaturallySpeaking supports voice commands like "click OK" and responds accordingly. Its solution to the problem of Microsoft Office doing its own custom widget rendering was to OCR the text on widgets and buttons to determine their labels. You need something like this far, far more often than you think you do. Developers will flummox you, they will NOT use the provided APIs.
We've done a bit of both for our screen seachable loom-like screen recorder, the problem is that the accessibility APIs differ greatly between Mac and Windows if you want to be OS agnostic and even on Windows all the apps tend to do things a little differently making it hard to say what did you actually "see", with some apps missing key data or implementing it incorrectly. OCR ends up being easier many times desptie thinking accessibility would be.
For sure, we made a privacy tradeoff to do it server side (given some screen change delta) because of this. Accessibility is a good "in addition to" but there are just so many apps that don't handle it well
I've built a workflow recorder with a screen history (MVP).
I concluded that in case this is a viable approach, Microsoft or Apple will built that into their OS natively as part of a copilot that remembers everything and comes to assist the user with the knowlede.
My screen history was not as advanced as the app mentioned here though. And I didn't use it myself.
This is what I added to my macOS app recently, foreground app metadata. It is displayed on the timeline if you look at the pictures on my website (screenmemory.app). For my use case it was night and day in UX.