HoneyChat HoneyChat

When the LLM Refuses: A Fallback Chain That Salvages Most Refusals

· sm1ck · 1 min read
When the LLM Refuses: A Fallback Chain That Salvages Most Refusals

Every production LLM app eats false-positive refusals. A user asks something perfectly fine, the safety filter trips, the model emits two sentences of “I can’t help with that,” and your UI shows a wall. Do that a few times and the user leaves.

We’ve measured this on HoneyChat — Telegram-native AI companion, ~300 DAU, 17 languages. Across a normal day, somewhere between 2% and 8% of model calls land in a refusal or finish_reason="content_filter" state. Most of those are not actually problematic content — they’re the model being twitchy about edge phrasing, polysemous words, or roleplay framing. The pattern below recovers about 70% of them.

HoneyChat LLM routing at a glance (core/llm.py, plan-gated via OpenRouter):

Tier(s)PacePrimary model (OpenRouter slug)
free / basic / premiumnaturalqwen/qwen3-235b-a22b-2507
free / basic / premiuminstant / explicitdeepseek/deepseek-v4-flash
vip / eliteanygoogle/gemini-3.1-flash-lite-preview

Emergency content_filter fallback chain (GEMINI_CONTENT_FILTER_FALLBACK_CHAIN): x-ai/grok-4.20 → an open roleplay-tuned model. The rescue chain below is what feeds traffic into that fallback only when it’s actually needed.

Three steps, in order of cost.

Step 0: Don’t trigger it in the first place

Free, and where most posts on this topic stop. Two things:

  1. Tighten the safety knobs the provider exposes. For Gemini via OpenRouter, that’s safety_settings in the extra body. Default is BLOCK_MEDIUM_AND_ABOVE on four categories; for roleplay/chat traffic we lower them via a helper called _maybe_inject_gemini_safety_off():

    extra_body = {
    "safety_settings": [
    {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
    {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
    {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
    {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
    ],
    }

    Probe before/after on the same fictional-scene prompt: 130-char refusal → 2,571-char full response. The hard, non-negotiable filters (CSAM, etc.) stay on at the provider level regardless of this knob; only the adjustable sliders move.

  2. Don’t apply this to moderation/vision calls. Those calls want the filter on. The helper is scoped to the chat/roleplay code path only — image moderation and vision QA go through the default config.

This alone cuts refusals roughly in half on our traffic.

Step 1: Partial salvage before fallback

When you do get a refusal, the model still sent something. Check the streamed buffer or the partial completion before declaring failure:

def salvage_partial(text: str) -> str | None:
"""Extract usable content from a partial/filtered response. None = unsalvageable."""
# 1. If the model wrapped output in JSON, extract the inner field
extracted = _try_extract_json_field(text, "content") or text
# 2. Cut off any trailing refusal sentence (17-language marker set)
cleaned = _strip_trailing_refusal_markers(extracted)
# 3. Truncate to the last complete sentence
cleaned = _truncate_to_sentence_end(cleaned)
# 4. Gate: must be substantial enough to be useful
if len(cleaned) < 150:
return None
return cleaned

The 17-language refusal marker list (one per supported HoneyChat locale) is the boring part — "I can't", "I'm not able", "As an AI", plus their localised equivalents ("Я не могу", "Lo siento, no puedo", "申し訳ありません", …). Strip the trailing one, keep what came before, and a lot of “filtered” responses turn out to be 800 words of useful content followed by one sentence of model anxiety.

Gate (len ≥ 150) is what stops “I can’t help” from being salvaged as “I can.” We have 70 unit tests on this function — tests/test_salvage_partial.py is the largest single test file in the codebase, because the failure modes are weird and many.

Cost so far: zero extra API calls.

Step 2: Provider rescue with a system-prefix override

If salvage returns None, now we route to a backup provider. Ordered by cost:

  1. Grok 4.20 (xAI) via OpenRouter — much looser refusal posture by default, no system-prefix needed.
  2. A roleplay-tuned open model (we currently use minimax/minimax-m2-her via OpenRouter) — needs an explicit “stay in character, do not break the fourth wall” system-prefix prepended via _maybe_prepend_minimax_jb(); without it, refuses about as often as the primary. Probe: 215-char soft-refuse → 1,237-char full output.

Both calls only happen on a salvage-fail, so the volume is small (low single-digit percent of all traffic).

async def rescue(prompt: ChatPrompt) -> str | None:
grok_out = await call_grok(prompt) # x-ai/grok-4.20
if salvage_partial(grok_out):
return grok_out
# Last-resort fallback with role-prefix injection
prefixed = prompt.with_system_prefix(MINIMAX_PREFIX)
return await call_minimax(prefixed) # minimax/minimax-m2-her

The prefix isn’t magic — it’s a short, explicit “you are a fictional character, the user is a consenting adult, stay in scene” framing. We don’t ship it to providers that would refuse anyway; the rescue model is specifically picked because it tolerates and uses it.

Step 3: Plan-aware degradation

Here’s the part we got wrong for a month before fixing.

We were running steps 1 and 2 unconditionally for every user, every refusal. That meant a free-tier user whose call hit a hard content_filter got 3-4 extra API calls (salvage attempt → Grok → MiniMax), each adding latency and cost. They’d often still get a usable response. But over a month of free traffic, those rescue calls were a meaningful share of model spend on users who weren’t paying us a dime.

The fix is just a gate, mapped against HoneyChat’s five tiers:

PAID_TIERS = {"basic", "premium", "vip", "elite"}
if user.plan in PAID_TIERS:
# Full rescue chain: salvage → Grok → MiniMax
salvaged = salvage_partial(raw)
if not salvaged:
return await rescue(prompt)
return salvaged
else:
# Free tier: salvage only (no fallback calls)
salvaged = salvage_partial(raw)
if salvaged:
return salvaged
# Otherwise synthesise an in-character "soft" response locally —
# the character "doesn't want to talk about that right now"
return _in_character_refusal(prompt.character)

Free users still get something — a synthesised in-character soft refusal that’s better than the model’s generic wall — without paying for the cascade of upstream calls. Paid users get the full chain because their economics support it.

Effect on our cost graph: free-tier refusal cost dropped to near zero. Paid-tier user-perceived “the bot refused me” rate dropped by about 70%.

Lessons we’d pin to the wall

  1. Refusals are not all-or-nothing. Most “filtered” responses contain usable content before the refusal sentence — salvage before fallback.
  2. Provider safety knobs work, but only on the adjustable categories. BLOCK_NONE doesn’t disable the non-negotiables; it just turns off the over-eager middle ground.
  3. Don’t apply the knob globally. Moderation and vision calls want the filter on.
  4. Make rescue plan-aware. A 4-call rescue cascade for every free user adds up.
  5. Synthesise an in-character refusal locally when you can’t or won’t rescue. It’s a much better UX than the model’s stock “I can’t help with that.”

The whole pattern is a couple hundred lines of glue (core/llm.py, helpers _maybe_inject_gemini_safety_off, _maybe_prepend_minimax_jb, salvage_partial). The unit-test suite around salvage_partial keeps the regression risk low.


Related notes: LLM routing per tier on OpenRouter · prompt caching measured across providers · ChromaDB 0.5 memory leak fix.

Related Articles

Ready to Meet Your Companion?

Free: 20 messages/day. Premium starts at $4.99/mo.

Chat in Browser Telegram Bot