Hacker Newsnew | past | comments | ask | show | jobs | submit | cahaya's commentslogin

looks nice!

Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables that are not understood correctly. Also Llamaindex fails miserably on those things.

Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip

I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist


> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:

But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.

For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.


> But that's something else, that's no longer just OCR ("Optical Character Recognition").

Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents where wider context may be important. But saying it's "not OCR" doesn't seem meaningful from a technical perspective. It's an extension of the same goal to convert images of documents into the most accurate and useful digitized form with the least manual intervention.


Personally I think it's a meaningful distinction between "Can extract text" VS "Can extract text and structure". It is true that some OCR systems can handle trying to replicate the structure, but still today I think that's the exception, not the norm.

Not to mention it's helpful to separate the two because there is such a big difference in the difficulty of the tasks.


> But that's something else, that's no longer just OCR ("Optical Character Recognition").

Lines often blur for technologies under such rapid evolution. Not sure it's helpful to nitpick the verbal semantics.

It is a fair question whether the OCR-inspired approach is the correct approach for more complex structured documents. But saying it's "not OCR" do doesn't seem meaningful from a technical perspective.


htceaad t nofdnsy lyruuieo sieerrr t owcope?


I threw in the first image/table into Gemini 2.5 Pro letting it choose the output format and it looks like it extracted the data just fine. It decided to represent the checkboxes as "checked" and "unchecked" because I didn't specify preferences.


Nice. Seems like i cannot run this on my Apple silicon M chips right?


If you have 64 GB of RAM you should be able to run the 4-bit quantized mlx models, which are specifically for the Apple silicon M chips. https://huggingface.co/collections/mlx-community/qwen3-next-...


Got 32GB so was hoping I could use ollm to offload it to my SSD. Slower but making it possible to run bigger models (in emergencies)


I have can host it on my M3 laptop somewhere around 30-40 tokens per second using mlx_lm's server command:

mlx_lm.server --model mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit --trust-remote-code --port 4444

I'm not sure if there is support for Qwen3-Next in any releases yet, but when I set up the python environment I had to install mlx_lm from source.


This particular one may not work on M chips, but the model itself does. I just tested a different sized version of the same model in LM Studio on a Macbook Pro, 64GB M2 Max with 12 cores, just to see.

Prompt: Create a solar system simulation in a single self-contained HTML file.

qwen3-next-80b (MLX format, 44.86 GB), 4bit 42.56 tok/sec , 2523 tokens, 12.79s to first token

- note: looked like ass, simulation broken, didn't work at all.

Then as a comparison for a model with a similar size, I tried GLM.

GLM-4-32B-0414-8bit (MLX format, 36.66 GB), 9.31 tok/sec, 2936 tokens, 4.77s to first token

- note: looked fantastic for a first try, everything worked as expected.

Not a fair comparison 4bit vs 8bit but some data. The tok/sec on Mac is pretty good depending on the models you use.


Depends how much ram yours has. Get a 4bit quant and it'll fit in ~40-50GB depending on context window.

And it'll run at like 40t/s depending on which one you have


I haven't tested on Apple machines yet, but gpt-oss and qwen3-next should work I assume. Llama3 versions use cuda specific loading logic for speed boost, so it won't work for sure


Nice, but can somebody tell me if this performs better than my simple Postgres MCP using npx? My current setup uses the LLM to search through my local Postgres in multiple steps. I guess this Pgmcp is doing multiple steps in the background and returns the final result to the LLM calling the MCP tool?

Codex: ``` [mcp_servers.postgresMCP] command = "npx" args = ["-y", "@modelcontextprotocol/server-postgres", "postgresql://user:password@localhost:5432/db"] ```

Cursor: ``` "postgresMCP": { "command": "npx", "args": [ "-y", "@modelcontextprotocol/server-postgres", "postgresql://user:password@localhost:5432/db" ] }, ```

With my setup i can easily switch between LLM's


nice! is there a way for the agent to know about it's own queries / resource usage?

eg the agent could actively monitor memory/cpu/time usage of a query and cancel it if it's taking too long?


Are there any existing scripts/ tools to use these evolutionary algorithms also at home with e.g. Codex/GPT-5 / Claude Code?


dspy approach seems rather similar to that: https://dspy.ai/tutorials/gepa_ai_program/


Is it me or is the link only opening as JSON?

{ "@context": [ "https://www.w3.org/ns/activitystreams", { "Hashtag": "as:Hashtag", "sensitive": "as:sensitive", "dcterms": "http://purl.org/dc/terms/" } ], "id": "https://neurofrontiers.blog/?p=11319", "type": "Note", "attachment": [ { "type": "Image", "url": "https://i0.wp.com/neurofrontiers.blog/wp-content/uploads/202...", "mediaType": "image/jpeg", "name": "A cartoon depicting a traffic jam of identical black cars, each driven by a brain. Thick clouds of exhaust rise and gather above the stalled vehicles." } ], "attributedTo": "https://neurofrontiers.blog/author/neuronerdb/", "audience": "https://neurofrontiers.blog/?author=0", "content": "<h2>How does air pollution


Same on first visit, but loaded fine after reload.


I tried adding Context7 Documentation MCP and got this

URL:https://mcp.context7.com/mcp Safety Scan: Passed

This MCP server can't be used by ChatGPT to search information because it doesn't implement our specification: search action not found https://platform.openai.com/docs/mcp#create-an-mcp-server


OpenAI is requiring a "search" and "fetch" tool in their specification. Requiring specific tools seems counter to the spirit of MCP. Imagine if every major player had their own interop tool specification.


ref-tools-mcp is similar and does support openai's deep research spec


I was also looking for a video. The concept sounds good, but feels like I need to learn a lot of new commands, or have a cheat sheet next to me to be able to be able to use the framework.


Cheatsheet is available via /pm:help

With that being said, a video will be coming very soon.


Wondering how average users can benefit from this platform with Claude Code and the relation to Vending Bench that tracks how much money LLM's can make.

https://andonlabs.com/evals/vending-bench


Nice! Missing a cost calculator with input and output fields.


Can add for the future


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: