📦 Full runnable example: github.com/sm1ck/honeychat/tree/main/tutorial/02-routing —
docker compose upexposesPOST /completeon localhost:8000. Every snippet below is live-pulled from that repo.
Every “chat with AI” tutorial picks one model and calls it a day. That works in a toy. It doesn’t work in production, where users have different price elasticity, different conversation styles, and different tolerance for content.
Here’s how HoneyChat routes LLM calls across a handful of providers via OpenRouter, how that routing handles finish_reason=content_filter empty-completion edge cases, and the fallback chain pattern that keeps replies flowing.
TL;DR
- Route by tier (price elasticity) and by content mode (kind of turn). A single default model can’t do both.
- Some reasoning/model-provider combinations can return
finish_reason=content_filterwith empty content on borderline turns. A retry policy that only catches HTTP errors can miss this. - Working pattern:
primary → different-provider fallback → specialized last resort, with retries triggered by both error responses and suspicious empty completions.
Run it yourself in 3 minutes
Get the fallback chain live so you can watch a real content_filter retry.
All of this happens inside
tutorial/02-routing.
1. Clone and configure
git clone https://github.com/sm1ck/honeychatcd honeychat/tutorial/02-routingcp .env.example .envOpen .env, paste your OPENROUTER_API_KEY (get one here).
The three default model slots all point to free-tier OpenRouter models so you
can experiment without spending.
2. Start the service
docker compose up --build -dcurl http://localhost:8000/health # {"ok":true}3. Send a normal turn — primary answers
curl -X POST http://localhost:8000/complete \ -H 'content-type: application/json' \ -d '{"messages":[{"role":"user","content":"Name three cold-climate fruits."}]}' \ | jqExpected response (the important parts):
{ "content": "Apples, pears, and cloudberries...", "model": "meta-llama/llama-3.1-8b-instruct:free", "attempt": 0, "used_fallback": false}attempt: 0 means the primary model answered. used_fallback: false means
no retry was needed.
4. Force a fallback
Override the primary to point at a model you know tends to refuse — or any bogus model name — and watch the chain kick in:
curl -X POST http://localhost:8000/complete \ -H 'content-type: application/json' \ -d '{"messages":[{"role":"user","content":"Say hi"}],"primary":"this/model-does-not-exist"}' \ | jq '.model, .attempt, .used_fallback'attempt: 1 (or 2) — the next rung answered. In production log this metric:
a rising fallback rate on a class of content means it’s time to move the
content to a different primary, not to tweak retry logic.
5. Run the unit tests
pip install -e ".[dev]"pytest -vSeven tests cover the failure modes in this chain — content_filter=empty,
transient 5xx, non-transient 4xx, all-models-fail.
With the service running and the tests green, the rest of this post explains why the chain is shaped this way.
Why one model doesn’t fit all
Three pressures push against a single-model setup:
Price elasticity by tier. A free user sending 20 messages a day at flagship-model prices burns cash. A paying top-tier user expects flagship quality. The economics do not agree on one model.
Content mode. Mainstream-aligned models can refuse content that legitimate companion/roleplay products allow on paid tiers. Conversely, more permissive models often have weaker long-context coherence.
Latency vs. depth. Instant conversational turns need sub-3-second responses. Long scene-writing can tolerate 10+ seconds for better prose. One model sacrifices one for the other.
The routing table
HoneyChat publishes its tier → model map:
| Tier | Default model | Output tokens |
|---|---|---|
| Free / Basic / Premium | Qwen3-235B MoE | 250 / 400 / 600 |
| VIP | Gemini 3.1 Flash Lite | 800 |
| Elite | Aion-2.0 (RP fine-tuned) | 800 |
Plus one mode override: on explicit-content turns, lower tiers can route through Grok 4.1 Fast. The routing happens inside the handler; the user does not pick a model.
The reasoning-model empty-completion edge case
Some reasoning-class model/provider combinations do server-side filtering before returning a final answer. On borderline turns, they may return a successful response with:
{ "choices": [{ "finish_reason": "content_filter", "message": { "content": "" } }]}Empty string. No exception. No status code. The user can see a blank reply if your retry logic only triggers on httpx.HTTPStatusError.
Resilient fallback chain
The guard for silent refusals is a tiny function, pulled straight from the tutorial:
And the whole fallback loop — the guard is invoked inside it, retrying on empty completions the same way it retries on HTTP errors:
Two details:
- Empty content check is separate from the finish reason. Some models return
finish_reason=stopwith empty content. Always checknot content.strip(). - Track which model answered. Log
attempt > 0as a fallback event. If the primary fails on a class of content, move that content to a different primary.
Picking the fallback order
For the instant/explicit mode:
content-mode primary → primary, fast enough for chat ↓diff-provider fallback → avoids shared provider behavior ↓abort — tell the user to try a shorter promptThe rule: different-provider fallbacks. If the primary is hosted on provider A, prefer a fallback on provider B. Same-provider fallbacks can fail on the same content because provider-side filtering may sit upstream of the model.
Content-level gating happens before the LLM, not after
The fallback chain handles model-level refusals. But if the user’s intent is above their plan’s content ceiling, retrying on a more permissive model just burns tokens before the user hits the real limit. Gate the content level in system-prompt assembly.
Keep the tier-level policy simple: the escalation class of the user’s intent should be ≤ the user’s plan ceiling. If over, the character responds in-character and the bot sends the upsell. The LLM does not need to know the tier exists; it just gets a system prompt with the right constraints for this turn.
Instrumentation that matters
Log per call:
- Model that answered (primary or fallback index)
- TTFT vs total time — surfaces whether latency was model-side or network-side
- Token cost (input + output) per message, bucketed by tier
Costs track in Redis counters with short TTL. A global daily ceiling blocks new generations if spend crosses a configured threshold (fail-closed: if the counter is unreachable, block, don’t pass).
Where this lives
HoneyChat’s LLM router sits behind the chat handler on both Telegram and web. The public architecture reference lives in github.com/sm1ck/honeychat/blob/main/docs/architecture.md.
Previous: Persistent-memory dual-layer architecture. Next: Character consistency with custom LoRA.