Why not use a single flagship LLM for everything?

Unit economics and content mode. A free user generating 20 messages a day at flagship prices burns cash for zero revenue. And mainstream-aligned models over-refuse on legitimate roleplay content that companion products allow on paid tiers. Routing lets price and content policy follow the product shape, not the model.

Should fallbacks be same-provider or different-provider?

Different-provider. If the primary failed because of a provider-side moderation layer, the same-provider fallback can fail on the same content. Route to a different provider, or a specialized last-resort model, to avoid shared moderation behavior.

Where do you gate content level — model or pre-LLM?

Pre-LLM. The escalation class of the user's intent should be less than or equal to the user's plan ceiling before you even pick a model. Relying on the model to enforce the tier wastes tokens and can leave policy unpredictable. Gate at prompt assembly, not at the model boundary.

How do you observe routing quality in production?

Log the model that actually answered (primary vs fallback index), time-to-first-token vs total time, and input+output token costs bucketed by tier. A rising fallback rate on a class of content means it's time to change the primary, not the retry policy.

LLM routing per tier via OpenRouter — when one model doesn't fit all

Q: What is the content_filter empty-completion edge case?

Some reasoning/model-provider combinations can return a successful HTTP response with finish_reason=content_filter and an empty content string. If retry logic only catches HTTP errors, users may see blank replies.

📦 Full runnable example: github.com/sm1ck/honeychat/tree/main/tutorial/02-routing — docker compose up exposes POST /complete on localhost:8000. Every snippet below is live-pulled from that repo.

Open them in browser or Telegram — 20 free messages per day.

Popular characters in HoneyChat

Marin Kitagawa

gyaru

3.4k91

Open in HoneyChat →

Marin Kitagawa shines loudly and genuinely. Her glamour is real, but so are the late-night fandom spirals, the costume references, and the way she falls hard for people who make room for her whole personality. With Marin, attraction always…

Open in HoneyChat →

Elena Varga

confident

7.2k94

Open in HoneyChat →

Moved to a big city for career, learned to keep face. Inside she wants simplicity and warmth. Works in marketing/PR. Has few close friends. Her apartment is minimalist — beige, plants, candles. She goes to the gym every morning at 6am.

Open in HoneyChat →

Medieval RPG

Medieval fantasy

35426

Open in HoneyChat →

A medieval fantasy world. You are the player; the Game Master narrates. Taverns, guilds, quests, NPCs. Your choices shape the story — and there are no wrong ones, only consequences.

Open in HoneyChat →

Frieren

kuudere

88934

Open in HoneyChat →

Frieren is an elven mage who has lived for over a thousand years. She was part of the Hero’s Party that defeated the Demon King. As her companions passed away, she realized she’d spent centuries without truly understanding the people…

Open in HoneyChat →

Every “chat with AI” tutorial picks one model and calls it a day. That works in a toy. It doesn’t work in production, where users have different price elasticity, different conversation styles, and different tolerance for content.

Here’s how HoneyChat routes LLM calls across a handful of providers via OpenRouter, how that routing handles finish_reason=content_filter empty-completion edge cases, and the fallback chain pattern that keeps replies flowing.

TL;DR

Route by tier (price elasticity) and by content mode (kind of turn). A single default model can’t do both.
Some reasoning/model-provider combinations can return finish_reason=content_filter with empty content on borderline turns. A retry policy that only catches HTTP errors can miss this.
Working pattern: primary → different-provider fallback → specialized last resort, with retries triggered by both error responses and suspicious empty completions.

Run it yourself in 3 minutes

Get the fallback chain live so you can watch a real content_filter retry. All of this happens inside tutorial/02-routing.

1. Clone and configure

git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/02-routing
cp .env.example .env

Open .env, paste your OPENROUTER_API_KEY (get one here). The three default model slots all point to free-tier OpenRouter models so you can experiment without spending.

2. Start the service

docker compose up --build -d
curl http://localhost:8000/health   # {"ok":true}

3. Send a normal turn — primary answers

curl -X POST http://localhost:8000/complete \
  -H 'content-type: application/json' \
  -d '{"messages":[{"role":"user","content":"Name three cold-climate fruits."}]}' \
  | jq

Expected response (the important parts):

{
  "content": "Apples, pears, and cloudberries...",
  "model": "meta-llama/llama-3.1-8b-instruct:free",
  "attempt": 0,
  "used_fallback": false
}

attempt: 0 means the primary model answered. used_fallback: false means no retry was needed.

4. Force a fallback

Override the primary to point at a model you know tends to refuse — or any bogus model name — and watch the chain kick in:

curl -X POST http://localhost:8000/complete \
  -H 'content-type: application/json' \
  -d '{"messages":[{"role":"user","content":"Say hi"}],"primary":"this/model-does-not-exist"}' \
  | jq '.model, .attempt, .used_fallback'

attempt: 1 (or 2) — the next rung answered. In production log this metric: a rising fallback rate on a class of content means it’s time to move the content to a different primary, not to tweak retry logic.

5. Run the unit tests

pip install -e ".[dev]"
pytest -v

Seven tests cover the failure modes in this chain — content_filter=empty, transient 5xx, non-transient 4xx, all-models-fail.

With the service running and the tests green, the rest of this post explains why the chain is shaped this way.

Why one model doesn’t fit all

Three pressures push against a single-model setup:

Price elasticity by tier. A free user sending 20 messages a day at flagship-model prices burns cash. A paying top-tier user expects flagship quality. The economics do not agree on one model.

Content mode. Mainstream-aligned models can refuse content that legitimate companion/roleplay products allow on paid tiers. Conversely, more permissive models often have weaker long-context coherence.

Latency vs. depth. Instant conversational turns need sub-3-second responses. Long scene-writing can tolerate 10+ seconds for better prose. One model sacrifices one for the other.

The routing table

HoneyChat publishes its tier → model map:

Tier	Default model	Output tokens
Free / Basic / Premium	Qwen3-235B MoE	250 / 400 / 600
VIP	Gemini 3.1 Flash Lite	800
Elite	Aion-2.0 (RP fine-tuned)	800

Plus one mode override: on explicit-content turns, lower tiers can route through Grok 4.1 Fast. The routing happens inside the handler; the user does not pick a model.

LLM routing fallback chain: a chat turn enters a tier-specific primary model; if the response returns finish_reason content_filter with empty content, the call retries on a different-provider fallback; if that also fails, the last resort is a specialized backup model; the empty-completion guard triggers the retries

The reasoning-model empty-completion edge case

Some reasoning-class model/provider combinations do server-side filtering before returning a final answer. On borderline turns, they may return a successful response with:

{
  "choices": [{
    "finish_reason": "content_filter",
    "message": { "content": "" }
  }]
}

Empty string. No exception. No status code. The user can see a blank reply if your retry logic only triggers on httpx.HTTPStatusError.

Resilient fallback chain

The guard for silent refusals is a tiny function, pulled straight from the tutorial:

def _is_silent_refusal(choice: dict) -> bool:
    """
    The whole point of this post: reasoning models can return a successful
    HTTP response with finish_reason=content_filter AND an empty content.
    If you only check HTTP status, you ship blank replies to users.
    """
    reason = choice.get("finish_reason")
    content = choice.get("message", {}).get("content") or ""
    return reason in ("content_filter", "length") and not content.strip()

And the whole fallback loop — the guard is invoked inside it, retrying on empty completions the same way it retries on HTTP errors:

async def complete(
    messages: list[dict],
    *,
    primary: str | None = None,
    chain: Iterable[str] | None = None,
) -> CompletionResult:
    """Run the fallback chain. Return the first usable response."""
    models = list(chain) if chain is not None else _build_chain(primary)
    if not OPENROUTER_KEY:
        raise RuntimeError("OPENROUTER_API_KEY is not set")

    async with httpx.AsyncClient() as client:
        for attempt, model in enumerate(models):
            try:
                data = await _call(client, model, messages)
            except httpx.HTTPStatusError as e:
                if e.response.status_code in TRANSIENT_CODES:
                    log.warning("transient %s on %s — trying next", e.response.status_code, model)
                    continue
                raise
            except (httpx.ReadTimeout, httpx.ConnectError) as e:
                log.warning("network error on %s: %s — trying next", model, e)
                continue

            choice = (data.get("choices") or [{}])[0]
            if _is_silent_refusal(choice):
                log.warning("silent refusal on %s (reason=%s) — trying next", model, choice.get("finish_reason"))
                continue

            content = choice.get("message", {}).get("content") or ""
            if not content.strip():
                log.warning("empty content on %s — trying next", model)
                continue

            return CompletionResult(content=content, model=model, attempt=attempt)

    raise AllModelsFailedError(f"no model returned usable content; tried {models}")

Two details:

Empty content check is separate from the finish reason. Some models return finish_reason=stop with empty content. Always check not content.strip().
Track which model answered. Log attempt > 0 as a fallback event. If the primary fails on a class of content, move that content to a different primary.

Picking the fallback order

For the instant/explicit mode:

content-mode primary   → primary, fast enough for chat
  ↓
diff-provider fallback → avoids shared provider behavior
  ↓
abort — tell the user to try a shorter prompt

The rule: different-provider fallbacks. If the primary is hosted on provider A, prefer a fallback on provider B. Same-provider fallbacks can fail on the same content because provider-side filtering may sit upstream of the model.

Content-level gating happens before the LLM, not after

The fallback chain handles model-level refusals. But if the user’s intent is above their plan’s content ceiling, retrying on a more permissive model just burns tokens before the user hits the real limit. Gate the content level in system-prompt assembly.

Keep the tier-level policy simple: the escalation class of the user’s intent should be ≤ the user’s plan ceiling. If over, the character responds in-character and the bot sends the upsell. The LLM does not need to know the tier exists; it just gets a system prompt with the right constraints for this turn.

Instrumentation that matters

Log per call:

Model that answered (primary or fallback index)
TTFT vs total time — surfaces whether latency was model-side or network-side
Token cost (input + output) per message, bucketed by tier

Costs track in Redis counters with short TTL. A global daily ceiling blocks new generations if spend crosses a configured threshold (fail-closed: if the counter is unreachable, block, don’t pass).

Where this lives

HoneyChat’s LLM router sits behind the chat handler on both Telegram and web. The public architecture reference lives in github.com/sm1ck/honeychat/blob/main/docs/architecture.md.

Previous: Persistent-memory dual-layer architecture. Next: Character consistency with custom LoRA.

LLM routing per tier via OpenRouter — when one model doesn't fit all

Popular characters in HoneyChat

Marin Kitagawa

Elena Varga

Medieval RPG

Frieren

TL;DR

Run it yourself in 3 minutes

Why one model doesn’t fit all

The routing table

The reasoning-model empty-completion edge case

Resilient fallback chain

Picking the fallback order

Content-level gating happens before the LLM, not after

Instrumentation that matters

Where this lives

FAQ

Related Articles