HoneyChat is three frontends and one backend in one monorepo, all running on a single 32 GB / 16-core Xeon host:
- Astro (
website/) — static MDX + RSS, serves the marketing site, blog and SEO landing pages. ~1,000 pages × 20 languages. - Next.js 15 (
web/) — SSR, serves the product surfaces: pricing, character profile, chat, profile, payments. The main DAU canvas (ChatRoom.tsxis ~3,000 lines). - React + Vite Mini App (
miniapp/) — opens inside the Telegram WebApp client. 20 languages. PWA + service worker. - FastAPI (
api/main.py,uvicorn --workers 4) behind nginx, serving both frontends. aiogramTelegram bot (bot/main.py) — polling, separate process.
On top of that, the docker-compose.yml defines ~14 services that all need to play nicely on deploy: bot, api, nextjs, celery_worker, celery_beat, gen_worker, lora_worker, cleanup_worker, retention_email_worker, postgres, redis, chromadb, nginx, certbot. Two docker networks: internal (service-to-service) and external (nginx only, ports 80/443).
For about six months, every other deploy broke something. A new Astro build that looked fine locally would 404 in production. A FastAPI rebuild would surface as 502 Bad Gateway for ten minutes. A working Mini App would silently keep serving the old bundle for users with the PWA installed.
None of these were code bugs. They were deploy-pipeline bugs. Specifically, five of them, repeated.
Here’s the contract we run now.
Rule 1: --force-recreate for Python, not restart
docker compose restart api does not re-read the built image. It restarts the process inside the existing container with the existing filesystem. If the image has been rebuilt with new Python code, restart won’t pick it up. The change ships only after the next time the container is replaced.
This catches people every release. The new code is on the host, the image is rebuilt, docker compose restart api returns immediately and looks happy, and you spend twenty minutes wondering why your fix didn’t ship.
The right invocation:
docker compose up -d --force-recreate --no-deps apiWe wrapped both forms in make:
deploy-api: docker compose build api docker compose up -d --force-recreate --no-deps api @sleep 3 docker compose restart nginx # see rule 2
restart-api: docker compose restart api # config-only, no code changesTwo targets, two intentions. Use restart-api only for config changes that the running process re-reads on signal.
The same rule applies to every Python service — bot, celery_worker, celery_beat, gen_worker, all four *_workers. We forgot this once for celery_beat after editing the RedBeat schedule and spent half an hour wondering why the new cron entries weren’t firing.
Rule 2: nginx caches upstream DNS at start
When nginx starts, it resolves each upstream’s hostname once and caches the IP for the life of the worker process. If the upstream container is recreated (rule 1), it gets a new IP on the internal docker network. nginx still has the old one.
The symptom is 502 Bad Gateway with host not found in upstream "api" in the nginx error log, even though docker compose ps shows the api container healthy and listening.
The fix is to restart nginx after restarting any service it routes to:
docker compose up -d --force-recreate --no-deps apisleep 3 # let api come up and accept connectionsdocker compose restart nginxThe sleep 3 is not superstition — if you restart nginx before api is accepting connections, nginx hits a different failure mode (refused connection) and you have to do it again.
There’s a way around this with resolver directives and set $upstream variables so nginx re-resolves on each request. We tried it. The simpler, deterministic restart pair turned out to be less surprising in practice.
Rule 3: One Makefile target per surface, never docker compose up -d
For a long time, deploying meant some combination of build + up -d + restart nginx and praying. Three commands, executed in the wrong order half the time, rebuilding every service even when only one changed. We replaced it with named targets:
# Astro static blog: build + sync to nginx volumewebsite: cd website && npm run build && cp -r dist/* ../frontend/website/ docker compose exec nginx nginx -s reload
# Next.js product app: build + recreate containerweb: docker compose build nextjs docker compose up -d --force-recreate --no-deps nextjs @sleep 3 docker compose restart nginx
# Mini App: build + sync + bump SWminiapp: cd miniapp && npm run build && cp -r dist/* ../frontend/app/dist/ ./scripts/bump_sw_timestamp.sh # see rule 5 docker compose exec nginx nginx -s reload
# API + bot + workers: build + recreate + nginxdeploy: docker compose build api bot docker compose up -d --force-recreate --no-deps api bot celery_worker gen_worker @sleep 3 docker compose restart nginx
# Full release: everything in the right orderdeploy-all: deploy web website miniappThe order in deploy-all matters: api/bot/workers first (everything else depends on them), then SSR Next.js (depends on api), then static Astro (depends on neither), then Mini App (depends on nothing but has its own SW dance).
Rule 4: Astro static + nginx — there is no atomic swap
Astro builds to website/dist/. We cp -r it into the nginx-mounted frontend/website/ volume. During the copy, some files exist with the new content and some still have the old content. A request that crosses that boundary can get a half-broken page.
For a blog at our traffic level the deploy window is short enough that nobody notices. We accept the trade.
If you can’t accept the trade, the patterns are:
rsync --deleteinto a sibling directory and atomicmv. nginx picks up the swap becausemvis atomic on the same filesystem.- Versioned subdirectories with a symlink swap.
dist-2026-05-28/next to acurrent -> dist-2026-05-28/symlink. Swap the symlink, reload nginx. - CDN in front of nginx. Push to origin, purge, the CDN does the swap.
Rule 5: Service workers need a kick
The Mini App has a service worker that caches the bundle. If sw.js doesn’t change between deploys, browsers serve users the cached old bundle indefinitely. Users see what looks like a bug (“the new feature doesn’t show up for me”) that’s actually their SW happily ignoring our deploy.
Two things are needed.
On build: stamp the SW with a timestamp so its bytes change every deploy.
#!/bin/bashSW=frontend/app/dist/sw.jsSTAMP=$(date +%s)sed -i "s|^const SW_VERSION = .*|const SW_VERSION = \"$STAMP\";|" "$SW"On client: a tiny snippet that triggers reload when the SW changes (decision D217 in our internal log).
if ("serviceWorker" in navigator) { let firstInstall = true; navigator.serviceWorker.addEventListener("controllerchange", () => { if (firstInstall) { firstInstall = false; return; } window.location.reload(); });}The firstInstall guard matters — without it, a brand-new user hitting the site for the first time triggers a reload they didn’t ask for. After the first install, subsequent SW activations are deploys and do warrant a reload.
We learned both halves the hard way. Stamping without the client listener: bundle changes but users stay on the cached one. Client listener without stamping: SW never changes, listener never fires, nothing reloads.
Bonus: the external GPU dependency
We’re not strictly self-contained — image generation hits a Vast.ai ComfyUI box over an SSH tunnel. Vast can change a container’s public IP at any time. We run a small watchdog (scripts/vast_watchdog.py, runs every few minutes) that detects an IP change and force-recreates bot, api, celery_worker, and gen_worker so the new tunnel target gets picked up. Three guards in the loop prevent infinite recreation cycles. This is the only deploy-related thing on our cluster that isn’t directly under make.
What “deploy” looks like now
make website # marketing/blog content pushmake web # product change in Next.jsmake miniapp # Mini App changemake deploy # backend change (api + bot + workers)make deploy-all # full release in the right orderEach target is the smallest unit that ships safely on its own. The rules above are baked into the targets, not held in someone’s head.
The number of 502 Bad Gateway incidents we’ve shipped since adopting this contract: roughly zero. The number of “users not seeing the new feature” tickets from cached SWs: also roughly zero. The cost was about a day of writing this Makefile and reading nginx docs more carefully.
Lessons
restartis notup -d --force-recreate. Use the right one for code vs config changes.- nginx upstream DNS is cached at start. Restart nginx after any service it routes to.
- One Makefile target per deploy surface. Don’t make humans remember the order.
- Static-file deploys aren’t atomic by default. Pick a pattern (symlink swap, rsync atomic, CDN) before you need one.
- Service workers need a stamp and a listener. Otherwise your users are stuck on yesterday’s bundle.
A deploy is not “I pushed the code”, it’s “users are reliably seeing the new code.” The middle distance — running container, served-but-not-rebuilt asset, cached-but-stale SW — is where deploys silently fail.
Related notes: Sentry SDK noise filter · range-DELETE postmortem · ChromaDB 0.5 leak fix.