Hacker Newsnew | past | comments | ask | show | jobs | submit | bluecoconut's commentslogin

Bytedance is publishing pretty aggressively.

Recently, my favorite from them was lumine: https://arxiv.org/abs/2511.08892

Here's their official page: https://seed.bytedance.com/en/research


Previous paper from DeepSeek has mentioned Anna’s Archive.

> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper


After maintaining my own agents library for a while, I’ve switched over to pydantic ai recently. I have some minor nits, but overall it's been working great for me. I’ve especially liked combining it with langfuse.

Towards coding agents, I wonder if there are any good / efficient ways to measure how much different implementations work on coding? SWE-bench seems good, but expensive to run. Effectively I’m curious for things like: given tool definition X vs Y (eg. diff vs full file edit), prompt for tool X vs Y (how it’s described, does it use examples), model choice (eg. MCP with Claude, but python-exec inline with GPT-5), sub-agents, todo lists, etc. how much across each ablation, does it matter? And measure not just success, but cost to success too (efficiency).

Overall, it seems like in the phase space of options, everything “kinda works” but I’m very curious if there are any major lifts, big gotchas, etc.

I ask, because it feels like the Claude code cli always does a little bit better, subjectively for me, but I haven’t seen a LLMarena or clear A vs B, comparison or measure.


https://www.tbench.ai/ the article also refers to this benchmark


The first time I got off at and heard Komagome's tune I mistakenly thought it was some halloween special because it was late October at the time, and the song felt so distinct and unique.


It's a rendition of Sakura Sakura, one of Japan's most famous (if not the most famous) folks songs :)

https://m.youtube.com/watch?v=jqpFjsMtCb0


Interestingly this one seems it is from before 高輪ゲートウェイ (Takanawa Gateway) station which opened in 2020, but the numbering shows the gap (JY 25 -> JY 27). That led me to looking it up, and turns out that they introduced the numbering in 2016, and that already came pre-planned with the gap ready [1].

[1] https://www.jreast.co.jp/press/2016/20160402.pdf


In the street where I grew up they had to renumber most of the houses one year because a row of new buildings were built, so everyone that was further down the street than the new houses had to have their numbers increased so that the new houses could be given numbers that were in order with where along the street they were built.

I wonder if that sort of renumbering is common or not, and if Japan is better at planning that sort of thing also.

I was too young at the time to know if this lead to any mail delivery issues, and I imagine the postal delivery service was made aware of the change. But I would think that even if they were notified it would sometimes be the case that if your house used to be say number 53 and now it’s 73 that mail that was intended for you ends up in the mail box of the house that used to be 33 and is now 53.

Even if not at first then at least like 3 years later when some random company still has your old address on file and most other mail for everyone in the street is usually addressed to updated numbers.


In Japan house numbers are based on construction date rather than position along the street.


And streets don't automatically have names, and with a few exceptions addresses are always based on city blocks instead. (See https://en.wikipedia.org/wiki/Japanese_addressing_system for a more complete explanation)


I'd assume most countries don't bother remapping when it comes to Street numbers ?

France has a suffix system, so you if a buildings are added between 24 and 25 you'll get 24 bis, 24 ter etc.

Japan doesn't care about the ordering in the first place, so a block added between 24 and 25 and 26 will be 32 without any issue.


Not getting around it, just benefiting from parallel compute / huge flops of GPUs. Fundamentally, it's just that prefill compute is itself highly parallel and HBM is just that much faster than LPDDR. Effectively H100s and B100s can chew through the prefill in under a second at ~50k token lengths, so the TTFT (Time to First Token) can feel amazingly fast.


I was able to get gpt-oss:20b wired up to claude code locally via a thin proxy and ollama.

It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes.

(This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)


> Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)

Not sure about ollama, but llama-server does have a transparent kv cache.

You can run it with

    llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --jinja --reasoning-format none

Web UI at http://localhost:8080 (also OpenAI compatible API)


I've been working on something very similar as a tool for my own AI research -- though I don't have the success they claim. Mine often plateaus on the optimization metric. I think there's secret sauce in the meta-prompting and meta-heuristic comments from the paper that are quite vague, but it makes sense -- it changes the dynamics of the search space and helps the LLM get out of ruts. I'm now going to try to integrate some ideas based off of my interpretation of their work to see how it goes.

If it goes well, I could open source it.

What are the things you would want to optimize with such a framework? (So far I've been focusing on optimizing ML training and architecture search itself). Hearing other ideas would help motivate me to open source if there's real demand for something like this.


This does seem similar to what has been done in the neural architecture search domain, doesn't it?

In my case, I'd mainly be interested in mathematics: I'd provide a mathematical problem and a baseline algorithm for it and would want an open source framework to be able to improve on that.


Also definitely interested in open-source ml search: there are so many new approaches (I follow this channel for innovations; it is overwhelming https://www.youtube.com/@code4AI) and it would be great being able to define a use case and having a search come up with the best approaches.


I work in the field of medical image processing. I haven't thought particularly hard about it, but I'm sure I could find a ton of use cases if I wanted to.


I’ve been using whisky to play Elden ring on my M4 MBP and it’s been great! I love that the Game porting toolkit and wine all work so well. I did have to do some pinning of steam to an older version to keep it working recently. I guess I’ll move over to crossover soon


Curious, what's the specs of your laptop and what frame rates were you getting? I've been considering getting rid of my gaming PC since I exclusively play Elden Ring


I tried to do this myself about ~1.5 years ago, but ran into issues with capturing state for sockets and open files (which started to show up when using some data science packages, jupyter widgets, etc.)

What are some of the edge cases where ForeverVM works and doesn't work? I don't see anything in the documentation about installing new packages, do you pre-bake what is available, and how can you see what libraries are available?

I do like that it seems the ForeverVM REPL also captures the state of the local drive (eg. can open a file, write to it, and then read from it).

For context on what I've tried: I used CRIU[1] to make the dumps of the process state and then would reload them. It worked for basic things, but ran into the issues stated above and abandoned the project. (I was trying to create a stack / undo context for REPLs that LLMs could use, since they often put themselves into bad states, and reverting to previous states seemed useful). If I remember correctly, I also ran into issues because capturing the various outputs (ipython capture_output concepts) proved to be difficult outside of a jupyter environment, and jupyter environments themselves were even harder to snapshot. In the end I settled for ephemeral but still real-server jupyter kernels where I via wrapper managed locals() and globals() as a cache, and would re-execute commands in order to rebuild state after the server restarts / crashes. This allowed me to also pip install new packages as well, so it proved more useful than simply static building my image/environment. But, I did lose the "serialization" property of the machine state, which was something I wanted.

That said, even though I personally abanonded the project, I still hold onto the dream of a full Tree/Graph of VMs (where each edge is code that is executed), and each VM state can be analyzed (files, memory, etc.). Love what ForeverVM is doing and the early promise here.

[1] https://criu.org/Main_Page


Good insight! We also initially tried to use Jupyter as a base but found that it had too much complexity (like the widgets you mention) for what we were trying to do and settled on something closer to a vanilla Python repl. This really simplified a lot.

We've generally prioritized edge case handling based on patterns we see come up in LLM-generated code. A nice thing we've found is that LLM-generated code doesn't usually try to hold network connections or file handles across invocations of the code interpreter, so even though we don't (currently) handle those it tends not to matter. We haven't provided an official list of libraries yet because we are actively working on arbitrary pypi imports which will make our pre-selected list obsolete.

> Love what ForeverVM is doing and the early promise here.

Thank you! Always means a lot from someone who has built in the same area.


> I was trying to create a stack / undo context for REPLs that LLMs could use, since they often put themselves into bad states, and reverting to previous states seemed useful

This is interesting! How did you end up achieving this? What tools are available for rolling back LLMs doing?


Dynamic languages like python should allow you to monkey patch calls so that instead of opening a regular socket, you are interacting with a wrapper that reopens the connection if it is lost. Could something like this work?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: