Why not embed every message into the vector store?

Per-message embedding explodes the index and degrades recall on short queries. Summary-level embedding keeps the index small and gives the model a more useful retrieval target — the 'arc' of a conversation chunk instead of isolated turns.

Why Redis for the hot layer instead of PostgreSQL?

Low-latency reads on every turn matter for conversational UX. Redis pipelining lets you do write + trim + expire in one round trip. A bounded list is O(1) per user regardless of conversation length.

How does this survive user-initiated state resets?

The background summarizer re-checks Redis existence before writing to ChromaDB. If a user cleared their history mid-flight, the summary is discarded. Concurrent summarize requests for the same user cancel the pending one so summaries don't overlap.

For this shape of workload, recall on short queries was worse than ChromaDB, and reindexing was more awkward when the schema evolved. ChromaDB's HTTP-native design fit this microservices setup better.

How is this different from Replika or Character.AI memory?

Consumer chatbots increasingly have memory features, but they usually optimize for profile facts, preferences, pinned snippets, or high-level continuity. This design is narrower and more explicit: recent verbatim context in Redis plus semantic retrieval over summarized conversation chunks, so the bot can pull back older events without replaying the full chat history.

Persistent-memory AI companion — dual-layer Redis + ChromaDB architecture

📦 Full runnable example: github.com/sm1ck/honeychat/tree/main/tutorial/01-memory — clone, docker compose up, chat with the demo bot on Telegram. Every code snippet below is a live pull from that repo; if you change code there, this article updates on the next build.

Open them in browser or Telegram — 20 free messages per day.

Popular characters in HoneyChat

Marin Kitagawa

gyaru

3.4k91

Open in HoneyChat →

Marin Kitagawa shines loudly and genuinely. Her glamour is real, but so are the late-night fandom spirals, the costume references, and the way she falls hard for people who make room for her whole personality. With Marin, attraction always…

Open in HoneyChat →

Elena Varga

confident

7.2k94

Open in HoneyChat →

Moved to a big city for career, learned to keep face. Inside she wants simplicity and warmth. Works in marketing/PR. Has few close friends. Her apartment is minimalist — beige, plants, candles. She goes to the gym every morning at 6am.

Open in HoneyChat →

Medieval RPG

Medieval fantasy

35426

Open in HoneyChat →

A medieval fantasy world. You are the player; the Game Master narrates. Taverns, guilds, quests, NPCs. Your choices shape the story — and there are no wrong ones, only consequences.

Open in HoneyChat →

Frieren

kuudere

88934

Open in HoneyChat →

Frieren is an elven mage who has lived for over a thousand years. She was part of the Hero’s Party that defeated the Demon King. As her companions passed away, she realized she’d spent centuries without truly understanding the people…

Open in HoneyChat →

Most AI chatbots still struggle with reliable, queryable long-term recall. Character.AI has pinned and chat memories, but unpinned details can still fall out of the active conversation context. Replika remembers profile facts, preferences, and generated memories, but that is not the same as semantic recall over the full conversation. Even ChatGPT’s Memory is built for useful preferences and details, not verbatim replay of long sessions.

HoneyChat was built for practical persistent memory — not just the current conversation, but older facts and events surfaced when they matter. Here’s the architecture that worked well for this use case.

TL;DR

Hot layer (Redis) — recent messages per conversation, short TTL, low-latency reads.
Cold layer (ChromaDB) holds summaries of chunks, not individual messages. Every N bot turns, a background task summarizes that window via a cheap LLM and stores the summary as a document.
On every user message, three retrieval paths fire in parallel via asyncio.gather: recent buffer, latest summary, top-K semantic search. All three get assembled into the system prompt.
Result: substantially fewer tokens than full-history replay, while still making old context retrievable weeks later.

Run it yourself in 5 minutes

Before the architectural deep-dive, get the demo running so you can poke the memory layers live. Everything below happens inside tutorial/01-memory.

1. Clone and enter the folder

git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/01-memory

2. Configure two tokens

cp .env.example .env

Open .env and fill:

TELEGRAM_BOT_TOKEN — get it from @BotFather (30 seconds: /newbot, pick a name, copy the token)
OPENROUTER_API_KEY — from openrouter.ai/keys. The default LLM_MODEL is a free-tier Llama 3.1 8B so you don’t spend a cent.

3. Start the stack

docker compose up --build -d
docker compose logs -f bot       # watch the bot come alive

Four containers come up: redis, chromadb, api (FastAPI inspector on localhost:8000), and bot (your Telegram bot polling).

4. Talk to your bot

Open the bot in Telegram (whichever name you gave it), hit /start, then chat for 10–20 turns. Tell it things about yourself. Change subjects. Come back later in the day and reference something you said earlier — it’ll pull it from ChromaDB.

5. Peek at what each layer holds

# Replace 12345 with your own Telegram user ID (find it by messaging @userinfobot)
curl http://localhost:8000/memory/12345/demo/recent  | jq
curl http://localhost:8000/memory/12345/demo/summary | jq

recent shows the raw Redis buffer. summary shows the latest ChromaDB document. After ~10 turns the summary appears — that’s the background writer compressing old context.

Clear everything if you want a fresh start:

curl -X POST http://localhost:8000/memory/12345/demo/clear

With the demo running, the rest of this post explains what you just booted — what each piece does and why.

Why rolling summaries alone aren’t enough

A common pattern — every N messages, regenerate a compressed version of older context — is lossy in a specific way: nuance dies in repeated compression.

Turn 1: "She said she hates her boss because he takes credit for her work"
Turn 2 summary: "User mentioned workplace frustration with manager"
Turn 3 summary: "User has job-related stress"
Turn 4 summary: "User has a job"

By turn 4, the reason is gone. A companion bot starts sounding generic. The fix: keep raw recent messages verbatim and only summarize chunks that are genuinely old, while being able to semantically retrieve any summary from the full history when the current conversation calls back.

Architecture

Dual-layer memory data flow — user message feeds an asyncio.gather of three parallel reads (Redis hot buffer, ChromaDB latest summary, ChromaDB semantic retrieval) into the system prompt assembly, which calls the LLM; LLM output is saved to Redis and triggers background summarization back into ChromaDB

Two independent layers. Writes to Redis are synchronous on every turn; writes to ChromaDB are asynchronous, batched every N turns. Reads from both happen in parallel on every message.

The hot layer — Redis

Each (user_id, character_id) conversation is stored as a bounded Redis list:

async def save_message(user_id: int, char_id: str, role: str, content: str) -> None:
    """Push a message to the per-conversation Redis list, bounded + TTL'd."""
    r = get_redis()
    key = _chat_key(user_id, char_id)
    msg = json.dumps({
        "role": role,
        "content": content,
        "ts": datetime.now(timezone.utc).isoformat(),
    })
    pipe = r.pipeline()
    pipe.rpush(key, msg)
    pipe.ltrim(key, -HOT_BUFFER_SIZE, -1)
    pipe.expire(key, 86400 * HOT_BUFFER_TTL_DAYS)
    await pipe.execute()

Three things matter here:

ltrim on every write. The list is bounded. Memory per user is O(1).
TTL extended on every write. Inactive users’ history evicts automatically. Configure Redis allkeys-lru so overflow evicts instead of refusing writes.
Pipelined writes. rpush + ltrim + expire in one round trip.

When the LLM needs context, return a tier-capped slice — lower tiers see a short window, higher tiers see more. The exact numbers are a product decision.

The cold layer — ChromaDB with summaries, not messages

A tempting implementation is to embed every message and run semantic search over them. Two problems: the index grows linearly with conversation volume (slow queries, big storage), and individual messages are often too short or context-free to retrieve meaningfully.

Instead: embed LLM-generated summaries of chunks.

async def auto_summarize(user_id: int, char_id: int):
    r = get_redis()
    key = f"chat:{user_id}:{char_id}:messages"
    if not await r.exists(key):
        return
    msgs = await get_chat_history(user_id, char_id, limit=N)
    if len(msgs) < MIN_MSGS_FOR_SUMMARY:
        return
    summary = await summarize_history(msgs)
    if not summary:
        return  # don't cache empty on LLM rate-limit
    if not await r.exists(key):
        return  # user cleared mid-call — bail
    await save_memory_summary(user_id, char_id, summary)

The summary captures the arc of the chunk — emotional state, key facts — not verbatim text. Each summary is one document in a per-(user, character) ChromaDB collection. Ten weeks of active conversation is maybe 30–50 documents per collection, not tens of thousands.

Retrieval — three paths in parallel

On every user message, three reads fire in parallel — this is the single most important function in the tutorial:

# ─── Assembled prompt context ──────────────────────────────────────────────

async def build_prompt_context(user_id: int, char_id: str, user_query: str) -> dict:
    """Parallel fire the three reads. Returns everything the handler needs."""
    recent, summary, memories = await asyncio.gather(
        get_recent(user_id, char_id),
        get_latest_summary(user_id, char_id),
        get_relevant_memories(user_id, char_id, user_query),
    )
    return {"recent": recent, "summary": summary, "memories": memories}

The fast path for the summary hits Redis. The slow path queries ChromaDB only when the Redis cache expired, then writes back.

Production issues that came up

Double-summarize race. Two concurrent messages for the same pair both trigger summarization, writing overlapping summaries. Fix: per-key task tracking, cancel the pending task if a new one fires.

User clears history mid-summarize. Re-check r.exists(key) before writing the summary.

Empty summaries cached. Guard if summary: before setex.

ChromaDB collection doesn’t exist for new users. col.query raises; wrap in try/except and return empty.

What would change on a rebuild

For this workload shape, ChromaDB worked better than pgvector on short-query recall.
Don’t embed per message by default — the index grows quickly, and recall may not improve.
Summarize fixed-size windows, not time-based batches.
Build cancellation pattern from day 1.

Where this lives

This architecture powers both honeychat.bot (web app) and @HoneyChatAIBot on Telegram. The public reference is in the engineering docs on GitHub — service topology, memory tables, and major flows.

Next in the series: LLM routing per tier via OpenRouter.

Persistent-memory AI companion — dual-layer Redis + ChromaDB architecture

Popular characters in HoneyChat

Marin Kitagawa

Elena Varga

Medieval RPG

Frieren

TL;DR

Run it yourself in 5 minutes

Why rolling summaries alone aren’t enough

Architecture

The hot layer — Redis

The cold layer — ChromaDB with summaries, not messages

Retrieval — three paths in parallel

Production issues that came up

What would change on a rebuild

Where this lives

FAQ

Related Articles