HoneyChat HoneyChat

We Deleted 10 Real Users with a Test-Cleanup Script — RCA

· sm1ck · 7 min read
We Deleted 10 Real Users with a Test-Cleanup Script — RCA

The incident, in two lines

On 2026-05-11, a test-cleanup script on HoneyChat (Telegram-native AI companion, ~3 months in production, ~300 DAU, PostgreSQL 16 + Redis) ran:

DELETE FROM users WHERE id BETWEEN -91111200 AND -91111100;

About ten real OAuth users had IDs in that narrow window. They were now gone. Their users row, their subscriptions row, their chat_sessions / web_messages — all gone from Postgres, and recovery from backup was effectively impossible (more on that below).

This is the postmortem and the contract we now run instead. The honest version: the destructive script went to prod on a schema I never verified end-to-end. Three separate design mistakes lined up to make it possible, and not one of them was caught before the script ran on a Tuesday night.

HoneyChat at a glance (for context): Stack runs aiogram (Telegram bot) + FastAPI (web API) + Celery workers + Next.js 15 web + Astro blog + React/Vite Mini App. Storage: PostgreSQL 16, Redis (90-day TTL on chat history), ChromaDB 0.5 (vector memory), Storj S3 (artifacts + backups). Off-site backup volume is ~7 GB / ~$0.03 a month — cost is not the constraint, cadence is.

How the same negative IDs ended up shared between test and real users

Two signup paths feed the users table:

PopulationID source
Telegram users (most of base)Positive integers — Telegram’s own user IDs come in on the message envelope
OAuth users (Google / Discord, web sign-in)Negative integers from a Postgres sequence web_user_id_seq

OAuth IDs were negative on purpose — to keep them out of the positive Telegram-ID space and avoid collisions when a Telegram user later signed in via web. The minter in api/web_auth.py looked roughly like this:

async def _allocate_negative_user_id(db) -> int:
for _ in range(5): # retry on rare UniqueViolation
new_id = -(await db.fetchval("SELECT nextval('web_user_id_seq')"))
try:
await db.execute("INSERT INTO users (id, ...) VALUES (%s, ...)", new_id)
return new_id
except UniqueViolation:
# someone else took it; bump the sequence past current MIN(id) and retry
await db.execute(
"SELECT setval('web_user_id_seq', GREATEST(-MIN(id), currval('web_user_id_seq')))"
" FROM users"
)
raise RuntimeError("could not allocate user id after 5 retries")

The setval(GREATEST(-MIN(id), current)) step is the load-bearing piece you have to keep in mind. It says: whatever the most-negative users.id is right now, my sequence should be at least that far advanced, so I never collide with it again.

For QA I was creating test users by hand with hardcoded negative IDs like -91111101, -91111102, … via INSERT ... ON CONFLICT (id) DO UPDATE. Easy to remember, easy to clean up later by range.

That choice triggered three independent failure modes, each on its own benign, lethal in combination:

  1. The first hardcoded test-user insert pushed web_user_id_seq to 91 111 101. Because of the setval(GREATEST(...)) line above, the very next OAuth signup retry saw the new test row with id = -91111101, computed -MIN(id) = 91111101, and advanced its own sequence. From that moment on, all real OAuth signups were drawing IDs in the neighbourhood of -91111111, -91111112, … — right inside the window where my test users lived.
  2. My test-user inserts used INSERT ... ON CONFLICT (id) DO UPDATE. When a real OAuth signup happened to land on the same ID I’d hardcoded, my script silently overwrote that user’s plan, auth_source and several other fields instead of erroring.
  3. The cleanup script then ran DELETE … WHERE id BETWEEN -91111200 AND -91111100 to remove the test users. Anyone whose OAuth ID had drifted into that 100-row window was a real user, and they went too.

None of these three behaviors is exotic. The setval(GREATEST(...)) retry pattern is a normal way to handle UniqueViolation on a seeded sequence. ON CONFLICT DO UPDATE is a normal Postgres upsert. Range-DELETE is a normal cleanup pattern. Each was safe on its own; the interaction of all three was lethal — and I never set up a staging run that would have surfaced the interaction before it touched prod.

A 30-second sanity check on the second insert (“did adding id = -91111101 move web_user_id_seq? what does the next OAuth signup land on?”) would have shown the cascading effect immediately. Nobody — me — ran it. The cleanup script ran nightly for weeks looking healthy because real OAuth signup volume hadn’t yet pushed a real ID into the deletion window.

What got deleted, what we couldn’t recover

Recovery from Postgres backup was effectively impossible. The chain:

  • The most recent pg_dump to Storj was about 22 hours old — and pre-dated my test-user inserts. The dump didn’t contain even the affected rows in their pre-overwrite state, because the ON CONFLICT DO UPDATE had already mutated their plan and auth_source columns earlier the same day.
  • WAL archiving was on the “after the next sprint” list and wasn’t on. So there was no point-in-time recovery between hourly snapshots.
  • Autovacuum had run between the DELETE and our discovery of the incident, so dead tuples on the relevant users pages were gone too.

What we could salvage came from side channels:

  • Recent chat turns — Redis with a 90-day TTL held the most-recent ~20 turns per affected session. We PERSIST-ed what looked important and reconstructed recent conversations for affected users.
  • Plan / subscription state — rebuilt from each payment provider’s webhook log. Our payments run over three providers (Telegram Stars as global primary, card payments through a regional web checkout, and CryptoBot for TON on the non-RU surface), all of which keep their own server-side record of who paid what.
  • chat_sessions and web_messages rows — lost. These are the canonical web-app message store and they only existed in Postgres. The 90-day Redis TTL covers the bot side, not the web-side conversation tree.

Net: people kept their accounts and most of their recent conversations, but lost web-side scene context older than the Redis window. We comped the affected users. The cost of the incident wasn’t the rows — it was the trust dent and the day-and-a-half of recovery work.

Root causes (plural — they always are)

  1. The schema interaction (sequence retry + ON CONFLICT DO UPDATE + range-DELETE) was never verified end-to-end before any of it touched production. Each piece was a fine pattern in isolation. The interaction was lethal. A single INSERT of id = -91111101 in staging followed by one OAuth signup, then a SELECT id FROM users ORDER BY id LIMIT 5, would have shown the sequence had jumped to the test neighbourhood. Nobody ran it. This is the primary cause and the one I lost the most sleep over.
  2. Test data was distinguished from real data by ID range, not by an attribute. A range is something a BETWEEN query can sweep. An attribute is something a WHERE auth_source = 'test' query cannot accidentally trip over.
  3. Test-user seeding used INSERT ON CONFLICT (id) DO UPDATE. This silently overwrote real OAuth users when their IDs collided, instead of raising. Pure INSERT would have failed loudly and surfaced the collision days before the DELETE.
  4. The cleanup script had no dry-run, no safety check, no assertion of expected row count.
  5. Backups were daily, not continuous, and the most recent one pre-dated the corrupting writes. WAL archiving was on the “soon” list and hadn’t shipped.

Any one of these five would have saved us; we had all five wrong.

The contract we now run

1. Test users have an attribute, not a range

ALTER TABLE users ADD COLUMN auth_source text NOT NULL DEFAULT 'oauth';
-- backfill: 'telegram' for positive Telegram IDs, 'oauth' for legacy negative,
-- 'test' for known test rows that we then deleted via the new path.
CREATE INDEX users_auth_source_idx ON users(auth_source);
scripts/test_user_factory.py
TEST_ID_RANGE = (1_000_000_001, 1_999_999_999) # high *positive* — out of all real paths
def create_test_user() -> int:
"""Create a test user with a high positive ID and explicit auth_source='test'."""
user_id = _next_test_id() # picks next free ID in TEST_ID_RANGE
db.execute(
"INSERT INTO users (id, auth_source, ...) VALUES (%s, 'test', ...)",
(user_id, ...),
)
return user_id

Cleanup is now keyed on the attribute:

scripts/test_user_cleanup.py
def cleanup_test_users(dry_run: bool = True) -> int:
rows = db.fetchall("SELECT id FROM users WHERE auth_source = 'test'")
if dry_run:
print(f"Would delete {len(rows)} test users")
return len(rows)
db.execute("DELETE FROM users WHERE auth_source = 'test'")
return len(rows)

The script defaults to dry_run=True. The CLI flag to actually run it is explicit and shows the count first.

2. Forbidden operations, documented

No range-DELETE on users.id for any reason. No INSERT ... ON CONFLICT (id) DO UPDATE on users.id (avoids id-collision-as-overwrite). These are written into the engineering CLAUDE.md and reviewed for at every PR that touches the users table.

3. Backup cadence with explicit RPO

We rebuilt the backup story around explicit recovery point objectives. Off-site is Storj S3-compatible storage (~7 GB total volume, ~$0.03/month — cost is not the constraint):

Backup tierCadenceDestinationRPO
Postgres pg_dump (logical)HourlyLocal disk≤ 1 h
Postgres pg_dump (logical)DailyStorj S3≤ 24 h
Off-site cold copyWeeklyStorj S3≤ 7 d
Redis snapshot (RDB)Every 6 hLocal + Storj≤ 6 h

WAL archiving to S3-compatible storage is still pending — that’s the next item. With it, RPO drops to seconds. Without it, hourly logical dumps are the floor.

4. Recovery rehearsal, not just backups

A backup you’ve never restored from is a hope, not a backup. We restore from yesterday’s hourly dump into a scratch container monthly, just to make sure the path still works. The first time we tried, the restore script had bit-rotted and didn’t compile.

Lessons

  1. Verify the partition scheme end-to-end before any destructive script touches prod. “Run the query without DELETE, in staging, against real data, and read the results” is thirty seconds of work. It is also the only thing that would have caught this.
  2. Range-based partitioning of test vs real data is an accident waiting to happen. Use attributes. Filter on them. Index them.
  3. Default cleanup scripts to dry-run. Make the destructive flag explicit and noisy.
  4. Assert expected counts. If the cleanup script suddenly finds 10× the usual rows, that is the signal to stop.
  5. Pick an RPO, then pick a backup cadence that meets it. Not the other way around.
  6. Restore from your backups on a schedule. Untested backups silently rot.

We’ve run the new contract for two weeks now. No range-DELETE incidents. The new auth_source = 'test' filter is boring and explicit and impossible to fat-finger. Boring is the goal.


Related: persistent-memory architecture · ChromaDB 0.5 memory leak fix · prompt caching measured.

Related Articles

Ready to Meet Your Companion?

Free: 20 messages/day. Premium starts at $4.99/mo.

Chat in Browser Telegram Bot