More

honorable_coder · 2025-10-01T16:12:42 1759335162

You mean Claude Code 2.0 Router? What's 2.0 about your router, isn't it the v1? And its not packaged into a CLI agent - its integrated into Claude Code (meaning you don't support other agents yet). Correct?

honorable_coder · 2025-09-30T14:55:10 1759244110

Hi HN — we're the team behind Arch-Router (https://huggingface.co/katanemo/Arch-Router-1.5B), A 1.5B preference-aligned LLM router that guides model selection by matching queries to user-defined domains (e.g., travel) or action types (e.g., image editing). Offering a practical mechanism to encode preferences and subjective evaluation criteria in routing decisions.

Today we’re extending that approach to Claude Code via Arch Gateway[1], bringing multi-LLM access into a single CLI agent with two main benefits:

1. Model Access: Use Claude Code alongside Grok, Mistral, Gemini, DeepSeek, GPT or local models via Ollama.

2. Preference-based Routing: Assign different models to specific coding tasks, such as – Code generation – Code reviews and comprehension – Architecture and system design – Debugging

Why not route based on public benchmarks? Most routers lean on performance metrics — public benchmarks like MMLU or MT-Bench, or raw latency/cost curves. The problem: they miss domain-specific quality, subjective evaluation criteria, and the nuance of what a “good” response actually means for a particular user. They can be opaque, hard to debug, and disconnected from real developer needs.

[1] Arch Gateway repo: https://github.com/katanemo/archgw

honorable_coder · 2025-09-22T18:03:21 1758564201

Today we’re shipping a major update to ArchGW (an edge and service proxy for agents [1]): a unified router that supports three strategies for directing traffic to LLMs — from explicit model names, to semantic aliases, to dynamic preference-aligned routing. Here’s how each works on its own, and how they come together.

Preference-aligned routing decouples task detection (e.g., code generation, image editing, Q&A) from LLM assignment. This approach captures the preferences developers establish when testing and evaluating LLMs on their domain-specific workflows and tasks. So, rather than relying on an automatic router trained to beat abstract benchmarks like MMLU or MT-Bench, developers can dynamically route requests to the most suitable model based on internal evaluations — and easily swap out the underlying moodel for specific actions and workflows. This is powered by our 1.5B Arch-Router LLM [2]. We also published our research on this recently[3]

Modal-aliases provide semantic, version-controlled names for models. Instead of using provider-specific model names like gpt-4o-mini or claude-3-5-sonnet-20241022 in your client you can create meaningful aliases like "fast-model" or "arch.summarize.v1". This allows you to test new models, swap out the config safely without having to do code-wide search/replace every time you want to use a new model for a very specific workflow or task.

Model-literals (nothing new) lets you specify exact provider/model combinations (e.g., openai/gpt-4o, anthropic/claude-3-5-sonnet-20241022), giving you full control and transparency over which model handles each request.

P.S. we routinely get asked why we didn't build semantic/embedding models for routing use cases or use some form of clustering technique. Clustering/embedding routers miss context, negation, and short elliptical queries, etc. An autoregressive approach conditions on the full context, letting the model reason about the task and generate an explicit label that can be used to match to an agent, task or LLM. In practice, this generalizes better to unseen or low-frequency intents and stays robust as conversations drift, without brittle thresholds or post-hoc cluster tuning.

[1] https://github.com/katanemo/archgw [2] https://huggingface.co/katanemo/Arch-Router-1.5B [2] https://arxiv.org/abs/2506.16655

honorable_coder · 2025-08-17T12:46:34 1755434794

We use this technique heavily for function-calling scenarios in https://github.com/katanemo/archgw, which uses a 3b function-calling model to neatly map a user's ask to one of many tools — the model doesn’t need to write an essay, it just needs to pick the right function immediately and the response can be synthesized by one of many configured upstream LLMs.

Why we do this: latency. A 3b parameter model, especially when quantized, can deliver sub-100ms time-to-first-token and generate a complete function call in under 50 tokens. That makes the LLM “disappear” as a bottleneck, so the only real waiting time is in the external tool or API being called + the time it takes to synthesize a human readable response.

sunscream89 · 2025-08-17T14:21:34 1755440494

Your approach is cool, a bit cringe to say it’s entropy. You’ve mitigated some response latency in exchange for an opportunity to refine the decision support up stream. It’s a nice strategy!

honorable_coder · 2025-08-02T22:09:06 1754172546

The core insight of decoupling route selection from model assignment is rooted in first principles engineering thinking. Someone recently wrote about their work in more detail here:

https://medium.com/@dracattusdev/finally-an-llm-router-that-...

honorable_coder · 2025-07-27T04:10:08 1753589408

Bit of context: the team that build envoy proxy is now building a new network substrate for agents treating prompts as a first class citizen in the stack. You can check out their open source efforts here: https://github.com/katanemo/archgw

honorable_coder · 2025-07-22T23:20:20 1753226420

and you say you aren't "vested" in liteLLM?

swyx · 2025-07-23T04:28:40 1753244920

yes, green text hn account, i am not. i just want help in properly identifying flaws in litellm. clearly nobody here is offering actual analysis.

honorable_coder · 2025-07-22T18:49:21 1753210161

a proxy means you offload observability, filtering, caching rules, global rate limiters to a specialized piece of software - pushing this in application code means you _cannot_ do things centrally and it doesn't scale as more copies of your application code get deployed. You can bounce a single proxy server neatly vs. updating a fleet of your application server just to monkey patch some proxy functionality.

AMeckes · 2025-07-22T19:03:25 1753211005

Good points! any-llm handles the LLM routing, but you can still put it behind your own proxy for centralized control. We just don't force that architectural decision on you. Think of it as composable: use any-llm for provider switching, add nginx/envoy/whatever for rate limiting if you need it.

honorable_coder · 2025-07-22T19:07:06 1753211226

How do I put this behind a proxy? You mean run the module as a containerized service?

But provider switching is built in some of these - and the folks behind envoy built: https://github.com/katanemo/archgw - developers can use an OpenAI client to call any model, offers preference-aligned intelligent routing to LLMs based on usage scenarios that developers can define, and acts as an edge proxy too.

AMeckes · 2025-07-22T19:29:25 1753212565

To clarify: any-llm is just a Python library you import, not a service to run. When I said "put it behind a proxy," I meant your app (which imports any-llm) can run behind a normal proxy setup.

You're right that archgw handles routing at the infrastructure level, which is perfect for centralized control. any-llm simply gives you the option to handle routing in your application code when that makes sense (For example, premium users get Opus-4). We leave the architectural choice to you, whether that's adding a proxy, keeping routing in your app, or using both, or just using any-llm directly.

sparacha · 2025-07-22T20:54:31 1753217671

But you can also use tokens to implement routing decisions in a proxy. You can make RBAC natively available to all agents outside code. The incremental feature work in code vs an out of process server is the trade off. One gets you going super fast the other offers a design choice that (I think) scales a lot better

RussianCow · 2025-07-22T18:53:43 1753210423

You can do all of that without a proxy. Just store the current state in your database or a Redis instance.

honorable_coder · 2025-07-22T19:00:19 1753210819

and managed from among the application servers that are greedily trying to store/retrieve this state? Not to mention you'll have to be in the business of defining, updating and managing the schema, ensuring that upgrades to the db don't break the application servers, etc, etc. The proxy server is the right design decision if you are truly trying to build something production worthy and you want it to scale.

RussianCow · 2025-07-26T15:29:54 1753543794

> Not to mention you'll have to be in the business of defining, updating and managing the schema, ensuring that upgrades to the db don't break the application servers, etc, etc.

I have to do this already with practically all software I write, so the comexity is already baked in. Sure, if you don't already have a database or cache, maybe a proxy is simpler, but otherwise it's just extra infrastructure you need to manage.

> The proxy server is the right design decision if you are truly trying to build something production worthy and you want it to scale.

I've been doing stuff like the above (not for LLMs but similar use cases) for years "at scale" without issues. But in any case, you need to store state the moment you scale beyond a single proxy server anyway. Plus, most products never achieve a scale where this discussion matters.

honorable_coder · 2025-07-22T18:43:05 1753209785

the people behind envoy proxy built: https://github.com/katanemo/archgw - has the learnings of Envoy but natively designed to process/route prompts to agents and LLMs. Would be curious about your thoughts

honorable_coder · 2025-07-18T20:16:04 1752869764

I see what you are doing there - HLRF for free.

jshchnz · 2025-07-18T20:57:58 1752872278

HLRF? do you mean RLHF?

honorable_coder · 2025-07-18T22:51:56 1752879116

yes yes, i am dyslexic so it shows up a lot more in short responses.