HoneyChat HoneyChat

LLM routing per tier via OpenRouter — when one model doesn't fit all

· · sm1ck · 1 min read
LLM routing per tier via OpenRouter — when one model doesn't fit all

📦 Full runnable example: github.com/sm1ck/honeychat/tree/main/tutorial/02-routingdocker compose up exposes POST /complete on localhost:8000. Every snippet below is live-pulled from that repo.

Every “chat with AI” tutorial picks one model and calls it a day. That works in a toy. It doesn’t work in production, where users have different price elasticity, different conversation styles, and different tolerance for content.

Here’s how HoneyChat routes LLM calls across a handful of providers via OpenRouter, how that routing handles finish_reason=content_filter empty-completion edge cases, and the fallback chain pattern that keeps replies flowing.

TL;DR

  • Route by tier (price elasticity) and by content mode (kind of turn). A single default model can’t do both.
  • Some reasoning/model-provider combinations can return finish_reason=content_filter with empty content on borderline turns. A retry policy that only catches HTTP errors can miss this.
  • Working pattern: primary → different-provider fallback → specialized last resort, with retries triggered by both error responses and suspicious empty completions.

Run it yourself in 3 minutes

Get the fallback chain live so you can watch a real content_filter retry. All of this happens inside tutorial/02-routing.

1. Clone and configure

Terminal window
git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/02-routing
cp .env.example .env

Open .env, paste your OPENROUTER_API_KEY (get one here). The three default model slots all point to free-tier OpenRouter models so you can experiment without spending.

2. Start the service

Terminal window
docker compose up --build -d
curl http://localhost:8000/health # {"ok":true}

3. Send a normal turn — primary answers

Terminal window
curl -X POST http://localhost:8000/complete \
-H 'content-type: application/json' \
-d '{"messages":[{"role":"user","content":"Name three cold-climate fruits."}]}' \
| jq

Expected response (the important parts):

{
"content": "Apples, pears, and cloudberries...",
"model": "meta-llama/llama-3.1-8b-instruct:free",
"attempt": 0,
"used_fallback": false
}

attempt: 0 means the primary model answered. used_fallback: false means no retry was needed.

4. Force a fallback

Override the primary to point at a model you know tends to refuse — or any bogus model name — and watch the chain kick in:

Terminal window
curl -X POST http://localhost:8000/complete \
-H 'content-type: application/json' \
-d '{"messages":[{"role":"user","content":"Say hi"}],"primary":"this/model-does-not-exist"}' \
| jq '.model, .attempt, .used_fallback'

attempt: 1 (or 2) — the next rung answered. In production log this metric: a rising fallback rate on a class of content means it’s time to move the content to a different primary, not to tweak retry logic.

5. Run the unit tests

Terminal window
pip install -e ".[dev]"
pytest -v

Seven tests cover the failure modes in this chain — content_filter=empty, transient 5xx, non-transient 4xx, all-models-fail.

With the service running and the tests green, the rest of this post explains why the chain is shaped this way.

Why one model doesn’t fit all

Three pressures push against a single-model setup:

Price elasticity by tier. A free user sending 20 messages a day at flagship-model prices burns cash. A paying top-tier user expects flagship quality. The economics do not agree on one model.

Content mode. Mainstream-aligned models can refuse content that legitimate companion/roleplay products allow on paid tiers. Conversely, more permissive models often have weaker long-context coherence.

Latency vs. depth. Instant conversational turns need sub-3-second responses. Long scene-writing can tolerate 10+ seconds for better prose. One model sacrifices one for the other.

The routing table

HoneyChat publishes its tier → model map:

TierDefault modelOutput tokens
Free / Basic / PremiumQwen3-235B MoE250 / 400 / 600
VIPGemini 3.1 Flash Lite800
EliteAion-2.0 (RP fine-tuned)800

Plus one mode override: on explicit-content turns, lower tiers can route through Grok 4.1 Fast. The routing happens inside the handler; the user does not pick a model.

LLM routing fallback chain: a chat turn enters a tier-specific primary model; if the response returns finish_reason content_filter with empty content, the call retries on a different-provider fallback; if that also fails, the last resort is a specialized backup model; the empty-completion guard triggers the retries

The reasoning-model empty-completion edge case

Some reasoning-class model/provider combinations do server-side filtering before returning a final answer. On borderline turns, they may return a successful response with:

{
"choices": [{
"finish_reason": "content_filter",
"message": { "content": "" }
}]
}

Empty string. No exception. No status code. The user can see a blank reply if your retry logic only triggers on httpx.HTTPStatusError.

Resilient fallback chain

The guard for silent refusals is a tiny function, pulled straight from the tutorial:

tutorial/02-routing/app/router.py · _is_silent_refusal
def _is_silent_refusal(choice: dict) -> bool:
"""
The whole point of this post: reasoning models can return a successful
HTTP response with finish_reason=content_filter AND an empty content.
If you only check HTTP status, you ship blank replies to users.
"""
reason = choice.get("finish_reason")
content = choice.get("message", {}).get("content") or ""
return reason in ("content_filter", "length") and not content.strip()

And the whole fallback loop — the guard is invoked inside it, retrying on empty completions the same way it retries on HTTP errors:

tutorial/02-routing/app/router.py · complete
async def complete(
messages: list[dict],
*,
primary: str | None = None,
chain: Iterable[str] | None = None,
) -> CompletionResult:
"""Run the fallback chain. Return the first usable response."""
models = list(chain) if chain is not None else _build_chain(primary)
if not OPENROUTER_KEY:
raise RuntimeError("OPENROUTER_API_KEY is not set")
async with httpx.AsyncClient() as client:
for attempt, model in enumerate(models):
try:
data = await _call(client, model, messages)
except httpx.HTTPStatusError as e:
if e.response.status_code in TRANSIENT_CODES:
log.warning("transient %s on %s — trying next", e.response.status_code, model)
continue
raise
except (httpx.ReadTimeout, httpx.ConnectError) as e:
log.warning("network error on %s: %s — trying next", model, e)
continue
choice = (data.get("choices") or [{}])[0]
if _is_silent_refusal(choice):
log.warning("silent refusal on %s (reason=%s) — trying next", model, choice.get("finish_reason"))
continue
content = choice.get("message", {}).get("content") or ""
if not content.strip():
log.warning("empty content on %s — trying next", model)
continue
return CompletionResult(content=content, model=model, attempt=attempt)
raise AllModelsFailedError(f"no model returned usable content; tried {models}")

Two details:

  1. Empty content check is separate from the finish reason. Some models return finish_reason=stop with empty content. Always check not content.strip().
  2. Track which model answered. Log attempt > 0 as a fallback event. If the primary fails on a class of content, move that content to a different primary.

Picking the fallback order

For the instant/explicit mode:

content-mode primary → primary, fast enough for chat
diff-provider fallback → avoids shared provider behavior
abort — tell the user to try a shorter prompt

The rule: different-provider fallbacks. If the primary is hosted on provider A, prefer a fallback on provider B. Same-provider fallbacks can fail on the same content because provider-side filtering may sit upstream of the model.

Content-level gating happens before the LLM, not after

The fallback chain handles model-level refusals. But if the user’s intent is above their plan’s content ceiling, retrying on a more permissive model just burns tokens before the user hits the real limit. Gate the content level in system-prompt assembly.

Keep the tier-level policy simple: the escalation class of the user’s intent should be the user’s plan ceiling. If over, the character responds in-character and the bot sends the upsell. The LLM does not need to know the tier exists; it just gets a system prompt with the right constraints for this turn.

Instrumentation that matters

Log per call:

  • Model that answered (primary or fallback index)
  • TTFT vs total time — surfaces whether latency was model-side or network-side
  • Token cost (input + output) per message, bucketed by tier

Costs track in Redis counters with short TTL. A global daily ceiling blocks new generations if spend crosses a configured threshold (fail-closed: if the counter is unreachable, block, don’t pass).

Where this lives

HoneyChat’s LLM router sits behind the chat handler on both Telegram and web. The public architecture reference lives in github.com/sm1ck/honeychat/blob/main/docs/architecture.md.

Previous: Persistent-memory dual-layer architecture. Next: Character consistency with custom LoRA.

FAQ

Why not use a single flagship LLM for everything?

Unit economics and content mode. A free user generating 20 messages a day at flagship prices burns cash for zero revenue. And mainstream-aligned models over-refuse on legitimate roleplay content that companion products allow on paid tiers. Routing lets price and content policy follow the product shape, not the model.

What is the content_filter empty-completion edge case?

Some reasoning/model-provider combinations can return a successful HTTP response with finish_reason=content_filter and an empty content string. If retry logic only catches HTTP errors, users may see blank replies.

Should fallbacks be same-provider or different-provider?

Different-provider. If the primary failed because of a provider-side moderation layer, the same-provider fallback can fail on the same content. Route to a different provider, or a specialized last-resort model, to avoid shared moderation behavior.

Where do you gate content level — model or pre-LLM?

Pre-LLM. The escalation class of the user's intent should be less than or equal to the user's plan ceiling before you even pick a model. Relying on the model to enforce the tier wastes tokens and can leave policy unpredictable. Gate at prompt assembly, not at the model boundary.

How do you observe routing quality in production?

Log the model that actually answered (primary vs fallback index), time-to-first-token vs total time, and input+output token costs bucketed by tier. A rising fallback rate on a class of content means it's time to change the primary, not the retry policy.

Related Articles

Ready to Meet Your Companion?

Free: 20 messages/day. Premium starts at $4.99/mo.

Chat in Browser Telegram Bot