We are not doing RLHF but fine-tuning directly on a reward function. Our task was around improving a coding agent, coding in JSONata(https://jsonata.org).
GPT4o is quite bad in this, as there are not too many JSONata snippets on the internet. We collected 20 coding problems; the reward function then just assigned a scalar value based on whether the code output of the model was syntactically correct or not (Most interestingly, we found that by optimizing the syntax, it also got better at getting the semantics correct)
I think the discrepancy between our result with direct RL and your experience with RLHF comes from the fact that RLHF is built around non-verifiable/subjective domains, where intrinsically, the reward signal obtained by the HF-proxy is weak(er), i.e. for the same training scenario/prompt you need more samples to get to the same gradient.
Yes, we wanted to incentivize, that people who want to use the platform (redeeming the $20 training credits) are also joining a slack channel, so we can give direct support. We should have pointed this out in the post.
No, not really. As I posted in the other thread, there are quite a few historical examples of why the big labs won’t take the entire market. They will push to publish something like this soon. Also, I think reinforcement fine-tuning is more convenient on the data-control side. Our platform allows you to self-host the reward function, so we only need the prompts; everything else can theoretically stay on the user side.
Thanks! Yes, absolutely. OpenAI already has a reinforcement learning fine-tuning API in closed beta. However, historically, they’ve always left significant room for integrations into users systems. E.g. in the current demo of their RL fine-tuning platform, you can only select predefined reward functions and must manually upload the query datasets. I think that's the reason why so many open-source supervised fine-tuning companies exist.
My long-term take is that the agent economy will be around a few labs providing (partially open-source) foundational models where you don’t want to be part of the competition, as this will be the AI equivalent of the high-frequency tradings arms race).
And above that will sit an infrastructure layer, specializing these very models to the users domains. OpenAI/Anthropic/… RL finetuning will be a part of that infrastructure layer, but so will open-source-model alternatives like ours.
Yes, great point. We are currently working on multistep RL.
The big problem with the trivial approach (give a single reward to the entire (ReAct) trajectory) is that the model receives a weak learning signal per decision (called credit assignment problem in literature), i.e. the individual decisions are not properly taken into account, which will then make the training unstable. I guess this has been an unsolved problem for a long time; however was not really looked at since generalist “planning” agents were not a big thing in RL until o1/DeepSeek.
IMO, the most promising approach to this is something along the lines of MA-RLHF (https://arxiv.org/abs/2410.02743) but adapted to the real world, i.e., spitting up the reward model to grade individual actions inside the trajectory to reduce the “attention distance” between the reward and the decision.
Hi HN, we are building a reinforcement-learning fine-tuning service for LLMs.
As we know, agents fail all the time. Especially when you try to use them for something actually useful. Current solution approaches suck; prompting has intrinsic limits and supervised fine-tuning requires big explicit datasets that are hard to collect.
So we built a platform to solve that. With Reinforcement Learning/GRPO. Inspired by DeepSeek R1.
You let us intercept your agent's data flow, and we deliver you a fine-tuned open-source model, that is trained on the agent's specific task.
Instead of providing big datasets of explicit fine-tuning samples, you provide a reward function, judging the model's outputs. Our Reinforcement Fine-tuning is very sample-efficient, so you get significantly better results with fewer training steps.
The current paradigm is best suited for “verifiable domains”, like teaching models to reason, write code, use tools & web, play chess, etc. We are working on natively supporting training on MCP protocols (to make the agents actually learn to use the MCP tools), codebases (to understand every edge case of your codebase), and browser agent frameworks.
Next, we will also support an “alignment mode”, where you don’t have to provide a reward function, but provide high-level feedback on past failure runs of your agent.
It is basically the open-source version of OpenAI’s reinforcement fine-tuning API, however deeply integrated into your agent’s stack.
Depending on demand, we consider also making the fine-tuned models downloadable in the future.
We give the first 50 signups $20 in free training credits, so you can try it out. We would love to hear your feedback!
You are right that, at the moment, the system inherently requires a 64-bit OS. We currently support Debian-based distros; it should work with other parent distributions as well, but you need to translate the installer script ;)
But we definitely need to highlight this more clearly in the docs. Thanks for pointing it out!
We also don’t have a definitive hardware spec requirement yet. We’ve tested it on Raspberry Pi 3s and later models (so anything more capable than a 3 should be fine).
> not ESP32 for example
Running on ESP32 is tricky because it would require porting libp2p to a embedded (which, as far as we know, nobody has done yet). However, we are considering support for embedded “light” nodes that run only a limited portion of the stack. It depends on the feedback we get. Do you have a use case where you’d need it to run on embedded?
We will most likely go with an open-core model. The main part will stay open source (the Core OS extension is under GPL3, and everything SDK-related is MIT).
For paid features, we have several ideas: a hosted management plane to configure and control the swarm (with company rbac integration) when one of the nodes is connected to the internet; advanced security (currently no access management or authentication is happening); sophisticated orchestration primitives; and LoRa connectivity (to scale the mesh radius to miles).
cool, makes sense. Thanks! I think often devs (me lol) are suspicious of free since we are in on or aware a lot of the data selling/ad surveillance schemes.
I love the general ideas around free and powerful p2p functionality with hosted management or higher level features.
> Is this primarily a passion project or are you hoping to get corporate sponsorship & adoption?
We are in the current YC W25 batch and our vision is to build a developer framework for autonomous robotics systems from the system we already have.
> Can you provide some insight as to why this would be preferred over an orchestration server?
It heavily depends on your application, there are applications where it makes sense and others where it doesn’t.
The main advantages are that you don’t need an internet connection, the system is more resilient against network outages, and most importantly, the resources on the robots, which are idle otherwise, are used.
I think for hobbyists, the main upsides is that it’s quick to set up, you only have to turn on the machines and it should work without having to care about networking or setting up a cloud connection.
> Would a 'mothership'/Wheel-and-spoke drone responsible for controlling the rest of the hive be considered an orchestration server?
If the mothership is static, in the sense that it doesn’t change over time, we would consider it an orchestration server. Our core services don’t need that and we envision that most of the decentralized algorithms running on our system also don’t rely on such central point of failure.
However, there are some applications where it makes sense to have a “temporary mothership”. We are just currently working on a “group” abstraction, which continuously runs a leader election to determine a “mothership” among the group (which is fault-tolerant however, as the leader can fail anytime and the system will instantly determine another one).
> The main advantages are that you don’t need an internet connection
To that end, I'm not clear on benefit in this model. To solve that problem I would just take a centralized framework and stick it inside an oversized drone/vehicle capable of carrying the added weight (in CPU, battery, etc.). There are several centralized models that don't require an external data connection
> the resources on the robots, which are idle otherwise, are used
But what's the benefit of this? I don't see the use case of needing the swarm to perform lots of calculations beyond the ones required for it's own navigation & communication with others. I suppose I could imagine a chain of these 'idle' drones acting as a communication relay between two separate, active hives. But the benefit there seems marginal.
> our system also don’t rely on such central point of failure
This seems like the primary upside, and it's a big one. I'm imagining a disaster or military situation where natural or human forces could be trying to disable the hive. Now instead of knocking out a single mothership ATV - each and every drone need to be removed to full disable it. Big advantage.
> We are just currently working on a “group” abstraction
Makes sense to me. That's the 'value add', might as well really spec that out
> leader election to determine a “mothership” among the group
This seems perfectly reasonable to me and doesn't remove the advantages of the disconnected "hive". But I do find it funny that the solution to decentralization seems to be simply having the centralization move around easily / flexibly. It's not a hive of peers, it's a hive of temporary kings.
> I would just take a centralized framework and stick it inside an oversized drone/vehicle capable of carrying the added weight
Makes sense. I think there are scenarios where such “base stations” are a priori available and “shielded,” so in this case, it might make more sense to just go with a centralized system. This could also be built on top of our system, though.
> But what’s the benefit of this?
I agree that, in many cases, the return on saving costs might be marginal. However, say you have a cluster of drones equipped with computing hardware capable enough to run all algorithms themselves—why spin up a cloud instance for running a centralized version of that algorithm? It is more of an engineering-ideological point, though ;)
> But I do find it funny that the solution to decentralization seems to be simply having the centralization move around easily / flexibly. It’s not a hive of peers, it’s a hive of temporary kings.
Most of our applications will not need this group leader. For example, the pubsub system does not work by aggregating and dispatching the messages at a central point (like MQTT) but employs a gossip mechanism (https://docs.libp2p.io/concepts/pubsub/overview/).
What I meant is that, in some situations, it might be more efficient (and it’s easier to reason about) to elect a leader. For example, say you have an algorithm that needs to do a matching between neighboring nodes —i.e., each node has some data point, and the algorithm wants to compute a pairwise similarity metric and share all computed metrics back to all nodes.
You could do some kind of “ring-structure” algorithm, where you have an ordering among the nodes, and each node receives data points from the predecessor, computes its own similarity against the incoming data point, and forwards the received data point to its successor.
If one node fails, the neighboring nodes in the ring will switch to the successor. This would be truly decentralized, and there is no single point of failure. However, in most cases, this approach will have a higher computation latency than just electing a temporary leader (by letting the leader compute the matchings and send them back to everyone).
So someone caring about efficiency (and not resiliency) will probably want such a leader mechanism.
Yeah, SLAM seems also like a natural showcase for us.
I am just working on a decentralised collaborative SLAM package on top of our system, where multiple robots can drive around and continuously merge their maps without a coordination server, using the Mesh integration and PubSub system. Should be out in about a week.
GPT4o is quite bad in this, as there are not too many JSONata snippets on the internet. We collected 20 coding problems; the reward function then just assigned a scalar value based on whether the code output of the model was syntactically correct or not (Most interestingly, we found that by optimizing the syntax, it also got better at getting the semantics correct)
I think the discrepancy between our result with direct RL and your experience with RLHF comes from the fact that RLHF is built around non-verifiable/subjective domains, where intrinsically, the reward signal obtained by the HF-proxy is weak(er), i.e. for the same training scenario/prompt you need more samples to get to the same gradient.