Micro-Agent: Beat Frontier Models with Collaboration inside Model API
Everyone is watching for the next frontier model.
The more interesting layer may be the one in front of it.
Routers are becoming the control plane for AI inference. Their first role was practical: route the right request to the right model. That already matters because production AI is no longer a one-model world.
A router can cut cost by deciding when a request deserves a frontier model and when an open-source or local model is enough. It can make safety policy executable by sending sensitive domains to stricter models, stricter filters, or stronger review paths. It can coordinate cloud and edge, keeping private or low-latency intent local while escalating harder work to the cloud.
Those are important jobs.
But the next router job is more interesting:
A router can make the model better.
Not by changing weights. Not by asking every application to build a bespoke agent graph. By turning one model API call into a bounded collaboration inside the serving layer.
Figure 1: The router is moving from model selection to capability construction.
This is why Sakana Fugu landed so loudly: it made a commercial product out of a simple but powerful idea, that a “model” can be a surface, and behind that surface can be a team. The research around this idea, including the Fugu technical report and coordination papers such as Conductor and Trinity, gives useful language for thinking about orchestration.
But the vLLM Semantic Router vision is different in where it puts the abstraction. Collaboration should not live only inside one commercial endpoint or one application-specific agent graph. It should become an open serving primitive.
vLLM Semantic Router brings that idea into the open serving layer. The user still calls one model:
{
"model": "vllm-sr/auto",
"messages": [{"role": "user", "content": "..."}]
}
Behind that stable model identity, the router can select a recipe, fan out to workers, collect a quorum, verify disagreement, synthesize a final answer, repair the output contract, and return one normal OpenAI-compatible response.
The point is not to expose complexity.
The point is to make collaboration feel like a model.
The Looper Is the Runtime
In vLLM Semantic Router, the looper is the execution runtime for bounded micro-agents.
A request enters the router as an ordinary chat completion. The router extracts signals, projects them into task-shape or risk bands, matches a decision, and then chooses an algorithm. That algorithm may be a normal single-model route, or it may be a looper route.
Today, the main looper patterns are:
- Confidence: a sequential escalation loop. It tries a cheaper candidate first, measures confidence, and escalates only when the score is too low.
- Ratings: a bounded fan-out loop. It runs multiple candidates under a hard concurrency cap and aggregates them with rating-aware weights.
- ReMoM: repeated mixture-of-model reasoning. It fans out breadth samples, waits for enough successful responses, and runs a final synthesis round.
- Fusion: a panel-judge-final pattern. Independent model responses become evidence for a judge and finalizer.
- Workflows: a micro-agent workflow runtime. It supports static roles or a dynamic planner, executes bounded worker steps, and synthesizes a final response.
Figure 2: Looper algorithms run inside the router while preserving the model API surface.
The implementation details matter. A looper is not a slogan for “ask more models.” It is a small runtime with budget, topology, trace, and failure policy.
Confidence: spend escalation only on hard cases
Confidence is the cost-aware loop. It starts with a smaller or cheaper candidate, then evaluates whether the answer is confident enough to stop. The confidence signal can come from token-level log probability, logprob margin, a hybrid score, self-verification, or an AutoMix-style entailment verifier.
If the score passes the threshold, the router returns immediately. If the score is too low, the route escalates to the next candidate. The important part is not that escalation exists. It is that escalation becomes explicit router policy: thresholds, failure behavior, and stopping conditions are visible and tunable.
Figure 3: Confidence turns escalation into a measured stopping policy.
Ratings: parallel quality under a hard cap
Ratings is the controlled ensemble loop. It launches several candidates in
parallel, but only up to a configured max_concurrent cap. That makes it useful
when a route should benefit from multiple model views without turning every
request into an unbounded fan-out.
The router collects successful responses, applies rating-aware aggregation, and handles failures according to the route policy. In practice, Ratings is a good fit for A/B-style evaluation, ensemble strategies, and routes where the operator already has meaningful per-candidate quality signals.
Figure 4: Ratings keeps multi-candidate execution bounded and rating-aware.
ReMoM: breadth with a contract
ReMoM is useful when the task has high reasoning variance and the answer format must survive the collaboration. It fans out multiple reasoning attempts, waits for a minimum-success quorum, then asks a synthesis model to merge evidence into the required output contract.
If synthesis fails but earlier workers produced valid evidence, the route does not have to collapse into an API error. It can fall back to the best valid evidence and still return a normal response.
Figure 5: ReMoM treats breadth, quorum, synthesis, and fallback as serving-time controls.
Fusion: disagreement as signal
Fusion starts from a different bet. Sometimes the useful object is not the average answer; it is the structure of disagreement. Independent panel answers become evidence. The judge sees agreement, contradiction, and unique insight, then the finalizer returns one answer with the trace collapsed behind the API.
That makes Fusion especially useful when there are plausible competing paths: hard multiple-choice reasoning, long-form expert judgment, or exact-answer tasks where a single confident response can be brittle.
Figure 6: Fusion does not hide disagreement. It turns disagreement into evidence.
Workflows: roles under a budget
Workflows is the most agentic pattern, and also the one that needs the strictest boundaries. The planner can only choose allowed worker models. The plan is validated. Steps are bounded by max steps, max parallelism, timeouts, and error policy. The final response still has to satisfy the output contract.
For SWE-style tasks, that means the router can express a planner, patcher, verifier, and finalizer without letting the application own a bespoke agent stack. For production serving, that distinction is critical: the loop is powerful, but it is still governed by infrastructure.
Figure 7: Workflows gives the router a bounded role system, not an unbounded autonomous agent.
Auto recipes: one model name, many loops
The public surface remains one model name: vllm-sr/auto. Internally, the
router can use signals and projections to choose the right loop for the request.
Difficulty, risk, contract pressure, latency, and cost are not comments in a
prompt. They are routing facts that can select Confidence, Ratings, ReMoM,
Fusion, Workflows, or a fallback path.
Figure 8: Auto recipes let signals choose the collaboration pattern while preserving one model identity.
This is the difference between “agent as app logic” and “micro-agent as serving runtime.” The router controls the budget, policy, topology, trace, and failure mode.
Recipes Beat One Universal Loop
The most important lesson from our eval work is not that one algorithm always wins.
It is the opposite:
The best loop is task-shaped.
GPQA-Diamond wants strict multiple-choice answer preservation. LiveCodeBench wants runnable code and hidden-test robustness. Humanity’s Last Exam wants disagreement resolution and exact-answer formatting. SWE-style tasks need a planner, patcher, verifier, and finalizer.
That is why vllm-sr/auto should not mean “always run the biggest loop.” It
should mean: select the recipe that fits this task.
Figure 9: Signals and projections let the router choose a benchmark-shaped collaboration pattern.
In our recipes, that shape is explicit:
- GPQA-Diamond routes hard science multiple-choice prompts into a ReMoM recipe
with strict
ANSWER: Xpreservation. - LiveCodeBench looks for constraints, starter code, standard input, float tolerance, timeout risk, and hidden-test risk before selecting a code-shaped loop.
- HLE detects formal reasoning, disagreement risk, long context, and exact answer pressure before choosing between deeper ReMoM, smaller Fusion, or a fallback path.
This is why router-side collaboration is more than prompt engineering. The prompt is only one part. The recipe also defines model pool, model roles, reasoning effort, concurrency, quorum, timeout, synthesis model, fallback policy, output contract, and observability labels.
The Scorecard Is a Proof, Not the Whole Story
We evaluated the current closed-model recipe across three hard benchmarks. The numbers are useful because they show that the idea is not only aesthetic.
Figure 10: VSR Closed and VSR Hybrid scorecard view across LiveCodeBench, GPQA-Diamond, and Humanity's Last Exam.
In this scorecard, VSR Closed means the recipe uses only closed-model backends. VSR Hybrid means the recipe mixes open and closed models, using the stronger closed models where the recipe needs higher-risk judging, repair, synthesis, or fallback.
| Benchmark | VSR scorecard row | Score | Reference rows |
|---|---|---|---|
| LiveCodeBench, January-April 2025 | VSR Closed | 92.6 | Fugu Ultra 92.0, Fugu 90.3, GPT-5.5 90.7, Opus 4.8 90.3 |
| GPQA-Diamond | VSR Closed | 96.0 | Fugu Ultra 95.5, Fugu 95.5, Gemini 3.1 Pro 94.3, GPT-5.5 93.6 |
| Humanity’s Last Exam | VSR Closed | 50.0 | Fugu Ultra 50.0, Fugu 48.5, Gemini 3.1 Pro 45.0 |
| Humanity’s Last Exam | VSR Hybrid | 47.1 | GLM-5.2 40.5, Qwen3.7 Max 41.4, GPT-5.5 41.4 |
The scorecard should be read carefully. It is not a claim that every request should always use every closed model. That would be the wrong product.
The claim is that router-owned collaboration can create a stronger model identity than the individual calls beneath it. It can beat or match frontier single-model baselines while preserving one API surface.
That is the real product shape:
- Users see one model name.
- Operators control the recipe.
- The system can improve without changing the client integration.
- Open and closed models can participate under the same serving abstraction.
What This Means for Model Serving
The old serving stack was passive. It accepted a model name and sent the request to a backend.
The next serving stack is active. It asks:
- What evidence do we have about this request?
- What quality, cost, latency, and safety band does it fall into?
- Is one model enough?
- If not, what collaboration pattern should run?
- Which answer contract must be preserved?
- What should happen if one provider is slow or wrong?
- How do we expose one clean response while keeping the full trace?
That is not application glue. That is infrastructure.
Micro-agents belong in the router because the router already owns the things micro-agents need: model aliases, provider policy, credentials, cost metadata, signals, decisions, retries, timeouts, traces, and OpenAI-compatible response semantics.
The Takeaway
The phrase “frontier model” is starting to mean two things.
One is a checkpoint.
The other is a system boundary.
The recent orchestration wave made the direction visible. vLLM Semantic Router is the bet that this capability should be programmable, observable, and open at the serving layer.
The next model race will still involve better models. But it will also involve better routers: routers that know when to save money, when to enforce safety, when to stay on the edge, when to go to the cloud, and when to turn one request into a small, disciplined team.
That is the promise of micro-agents inside the Model API.
Acknowledgements
We thank researchers from MBZUAI, McGill University, Mila, and Agentic Intelligence Lab, especially Prof. Xue Liu and Dr. Bowei He, for research collaboration and discussions around router-side model collaboration.
Individual Contributors: Huamin Chen, Yincheng Ren.
We also thank AMD’s Andy Luo and Haichen Zhang for AMD GPU evaluation support.