Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables that are not understood correctly. Also Llamaindex fails miserably on those things.
I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples:
https://www.easa.europa.eu/en/icao-compliance-checklist
> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:
But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.
For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.
> But that's something else, that's no longer just OCR ("Optical Character Recognition").
Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.
It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents where wider context may be important. But saying it's "not OCR" doesn't seem meaningful from a technical perspective. It's an extension of the same goal to convert images of documents into the most accurate and useful digitized form with the least manual intervention.
Personally I think it's a meaningful distinction between "Can extract text" VS "Can extract text and structure". It is true that some OCR systems can handle trying to replicate the structure, but still today I think that's the exception, not the norm.
Not to mention it's helpful to separate the two because there is such a big difference in the difficulty of the tasks.
> But that's something else, that's no longer just OCR ("Optical Character Recognition").
Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.
It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents. But saying it's "not OCR" do doesn't seem meaningful from a technical perspective.
I threw in the first image/table into Gemini 2.5 Pro letting it choose the output format and it looks like it extracted the data just fine. It decided to represent the checkboxes as "checked" and "unchecked" because I didn't specify preferences.
This particular one may not work on M chips, but the model itself does. I just tested a different sized version of the same model in LM Studio on a Macbook Pro, 64GB M2 Max with 12 cores, just to see.
Prompt: Create a solar system simulation in a single self-contained HTML file.
qwen3-next-80b (MLX format, 44.86 GB), 4bit 42.56 tok/sec , 2523 tokens, 12.79s to first token
- note: looked like ass, simulation broken, didn't work at all.
Then as a comparison for a model with a similar size, I tried GLM.
GLM-4-32B-0414-8bit (MLX format, 36.66 GB), 9.31 tok/sec, 2936 tokens, 4.77s to first token
- note: looked fantastic for a first try, everything worked as expected.
Not a fair comparison 4bit vs 8bit but some data. The tok/sec on Mac is pretty good depending on the models you use.
I haven't tested on Apple machines yet, but gpt-oss and qwen3-next should work I assume. Llama3 versions use cuda specific loading logic for speed boost, so it won't work for sure
Nice, but can somebody tell me if this performs better than my simple Postgres MCP using npx? My current setup uses the LLM to search through my local Postgres in multiple steps. I guess this Pgmcp is doing multiple steps in the background and returns the final result to the LLM calling the MCP tool?
OpenAI is requiring a "search" and "fetch" tool in their specification. Requiring specific tools seems counter to the spirit of MCP. Imagine if every major player had their own interop tool specification.
I was also looking for a video. The concept sounds good, but feels like I need to learn a lot of new commands, or have a cheat sheet next to me to be able to be able to use the framework.
Wondering how average users can benefit from this platform with Claude Code and the relation to Vending Bench that tracks how much money LLM's can make.
reply