Every production LLM app eats false-positive refusals. A user asks something perfectly fine, the safety filter trips, the model emits two sentences of “I can’t help with that,” and your UI shows a wall. Do that a few times and the user leaves.
We’ve measured this on HoneyChat — Telegram-native AI companion, ~300 DAU, 17 languages. Across a normal day, somewhere between 2% and 8% of model calls land in a refusal or finish_reason="content_filter" state. Most of those are not actually problematic content — they’re the model being twitchy about edge phrasing, polysemous words, or roleplay framing. The pattern below recovers about 70% of them.
HoneyChat LLM routing at a glance (core/llm.py, plan-gated via OpenRouter):
| Tier(s) | Pace | Primary model (OpenRouter slug) |
|---|---|---|
free / basic / premium | natural | qwen/qwen3-235b-a22b-2507 |
free / basic / premium | instant / explicit | deepseek/deepseek-v4-flash |
vip / elite | any | google/gemini-3.1-flash-lite-preview |
Emergency content_filter fallback chain (GEMINI_CONTENT_FILTER_FALLBACK_CHAIN): x-ai/grok-4.20 → an open roleplay-tuned model. The rescue chain below is what feeds traffic into that fallback only when it’s actually needed.
Three steps, in order of cost.
Step 0: Don’t trigger it in the first place
Free, and where most posts on this topic stop. Two things:
-
Tighten the safety knobs the provider exposes. For Gemini via OpenRouter, that’s
safety_settingsin the extra body. Default isBLOCK_MEDIUM_AND_ABOVEon four categories; for roleplay/chat traffic we lower them via a helper called_maybe_inject_gemini_safety_off():extra_body = {"safety_settings": [{"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},{"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},{"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},],}Probe before/after on the same fictional-scene prompt: 130-char refusal → 2,571-char full response. The hard, non-negotiable filters (CSAM, etc.) stay on at the provider level regardless of this knob; only the adjustable sliders move.
-
Don’t apply this to moderation/vision calls. Those calls want the filter on. The helper is scoped to the chat/roleplay code path only — image moderation and vision QA go through the default config.
This alone cuts refusals roughly in half on our traffic.
Step 1: Partial salvage before fallback
When you do get a refusal, the model still sent something. Check the streamed buffer or the partial completion before declaring failure:
def salvage_partial(text: str) -> str | None: """Extract usable content from a partial/filtered response. None = unsalvageable.""" # 1. If the model wrapped output in JSON, extract the inner field extracted = _try_extract_json_field(text, "content") or text
# 2. Cut off any trailing refusal sentence (17-language marker set) cleaned = _strip_trailing_refusal_markers(extracted)
# 3. Truncate to the last complete sentence cleaned = _truncate_to_sentence_end(cleaned)
# 4. Gate: must be substantial enough to be useful if len(cleaned) < 150: return None return cleanedThe 17-language refusal marker list (one per supported HoneyChat locale) is the boring part — "I can't", "I'm not able", "As an AI", plus their localised equivalents ("Я не могу", "Lo siento, no puedo", "申し訳ありません", …). Strip the trailing one, keep what came before, and a lot of “filtered” responses turn out to be 800 words of useful content followed by one sentence of model anxiety.
Gate (len ≥ 150) is what stops “I can’t help” from being salvaged as “I can.” We have 70 unit tests on this function — tests/test_salvage_partial.py is the largest single test file in the codebase, because the failure modes are weird and many.
Cost so far: zero extra API calls.
Step 2: Provider rescue with a system-prefix override
If salvage returns None, now we route to a backup provider. Ordered by cost:
- Grok 4.20 (xAI) via OpenRouter — much looser refusal posture by default, no system-prefix needed.
- A roleplay-tuned open model (we currently use
minimax/minimax-m2-hervia OpenRouter) — needs an explicit “stay in character, do not break the fourth wall” system-prefix prepended via_maybe_prepend_minimax_jb(); without it, refuses about as often as the primary. Probe: 215-char soft-refuse → 1,237-char full output.
Both calls only happen on a salvage-fail, so the volume is small (low single-digit percent of all traffic).
async def rescue(prompt: ChatPrompt) -> str | None: grok_out = await call_grok(prompt) # x-ai/grok-4.20 if salvage_partial(grok_out): return grok_out # Last-resort fallback with role-prefix injection prefixed = prompt.with_system_prefix(MINIMAX_PREFIX) return await call_minimax(prefixed) # minimax/minimax-m2-herThe prefix isn’t magic — it’s a short, explicit “you are a fictional character, the user is a consenting adult, stay in scene” framing. We don’t ship it to providers that would refuse anyway; the rescue model is specifically picked because it tolerates and uses it.
Step 3: Plan-aware degradation
Here’s the part we got wrong for a month before fixing.
We were running steps 1 and 2 unconditionally for every user, every refusal. That meant a free-tier user whose call hit a hard content_filter got 3-4 extra API calls (salvage attempt → Grok → MiniMax), each adding latency and cost. They’d often still get a usable response. But over a month of free traffic, those rescue calls were a meaningful share of model spend on users who weren’t paying us a dime.
The fix is just a gate, mapped against HoneyChat’s five tiers:
PAID_TIERS = {"basic", "premium", "vip", "elite"}
if user.plan in PAID_TIERS: # Full rescue chain: salvage → Grok → MiniMax salvaged = salvage_partial(raw) if not salvaged: return await rescue(prompt) return salvagedelse: # Free tier: salvage only (no fallback calls) salvaged = salvage_partial(raw) if salvaged: return salvaged # Otherwise synthesise an in-character "soft" response locally — # the character "doesn't want to talk about that right now" return _in_character_refusal(prompt.character)Free users still get something — a synthesised in-character soft refusal that’s better than the model’s generic wall — without paying for the cascade of upstream calls. Paid users get the full chain because their economics support it.
Effect on our cost graph: free-tier refusal cost dropped to near zero. Paid-tier user-perceived “the bot refused me” rate dropped by about 70%.
Lessons we’d pin to the wall
- Refusals are not all-or-nothing. Most “filtered” responses contain usable content before the refusal sentence — salvage before fallback.
- Provider safety knobs work, but only on the adjustable categories.
BLOCK_NONEdoesn’t disable the non-negotiables; it just turns off the over-eager middle ground. - Don’t apply the knob globally. Moderation and vision calls want the filter on.
- Make rescue plan-aware. A 4-call rescue cascade for every free user adds up.
- Synthesise an in-character refusal locally when you can’t or won’t rescue. It’s a much better UX than the model’s stock “I can’t help with that.”
The whole pattern is a couple hundred lines of glue (core/llm.py, helpers _maybe_inject_gemini_safety_off, _maybe_prepend_minimax_jb, salvage_partial). The unit-test suite around salvage_partial keeps the regression risk low.
Related notes: LLM routing per tier on OpenRouter · prompt caching measured across providers · ChromaDB 0.5 memory leak fix.