HoneyChat HoneyChat

Character consistency in AI image generation — where prompts break down and LoRA helps

· · sm1ck · 4 min read
Character consistency in AI image generation — where prompts break down and LoRA helps

📦 Training template: github.com/sm1ck/honeychat/tree/main/tutorial/03-lora — a generic Kohya SDXL config with <tune> placeholders and a dataset curation guide. No docker-compose (LoRA training is GPU-heavy) — you bring your own GPU or rent one.

Here’s a failure mode many AI companion apps run into on launch day: users send two requests in a row for the same character, get two different faces, and conclude the product is broken. They’re not wrong to feel that way. Character identity is part of the product.

This post is about why that happens, why the obvious fixes often don’t fully solve it, and what class of solution works better. Concrete hyperparameters stay internal — the reference is enough to reproduce the right shape.

TL;DR

  • Identical seed + identical prompt + different batch size = different face. Seeds only help within the same sampler run.
  • Prompt detail plateaus fast. Past a certain tag count, the model interpolates anyway.
  • Reference image (IP-Adapter) works but can bleed stylistic features — outfit, lighting, background — into generations where you only wanted identity.
  • Custom LoRA per character makes identity much more stable by encoding it at the weights level instead of relying only on prompt text.

Train your own character LoRA — the short walkthrough

LoRA training is GPU-heavy and doesn’t belong in a docker-compose, so the tutorial folder at tutorial/03-lora ships the config template and recipe. You bring the GPU.

1. Get a GPU

24 GB VRAM (e.g. RTX 3090/4090) fits SDXL LoRA at batch size 2–4 comfortably. Don’t own one? Rent a spot — Vast.ai, RunPod, Modal, Paperspace, Lambda. A full training run costs a few dollars.

2. Install Kohya_ss

Terminal window
git clone https://github.com/bmaltais/kohya_ss ~/kohya_ss
cd ~/kohya_ss && ./setup.sh

3. Grab the template

Terminal window
cd ~/projects
git clone https://github.com/sm1ck/honeychat
cp -r honeychat/tutorial/03-lora ./my-character-lora
cd my-character-lora

4. Prepare your dataset

Drop 15–30 varied images of your subject into dataset/train/5_character/ (the 5_ is the repeat count). For each image, create a same-named .txt caption describing the scene — not the character. See dataset/README.md for the full curation checklist.

5. Fill the <tune> slots in kohya-config.toml

Every hyperparameter is a placeholder you pick based on your dataset and base model. Read the inline comments, then replace each <tune> with a real value. The safety check in train.sh will refuse to run if any placeholder remains.

6. Train

Terminal window
export KOHYA_DIR=~/kohya_ss
bash train.sh

The checkpoint lands at ./output/<your-character>.safetensors. Load it into ComfyUI or Diffusers like any other SDXL LoRA. Generate a test grid, iterate, retrain if needed.

The rest of this post explains why this pipeline shape works and what breaks when you try to shortcut it.

Why “same prompt, same face” doesn’t hold

Three reasons.

Batch size changes the output. batch_size=1 vs batch_size=4 with the same seed produce different images for position 0. The RNG state depends on batch dimension.

Provider-side sampler drift. Managed APIs update samplers and models over time. Your previously stable character can drift across weeks.

Prompt detail saturates. Adding more tags (“sharp nose, narrow eyes, specific mole position”) doesn’t help past a point. The model has a rough template and interpolates.

The in-between fix that doesn’t quite work: IP-Adapter

IP-Adapter lets you pass a reference image alongside the prompt. For product photography (render this dress on a model), it can be excellent. For character identity, it has a practical drawback: IP-Adapter can carry stylistic baggage. A reference photo with specific lighting, pose, or outfit can bleed those into generations where you only wanted the face. Turn the weight down and identity may degrade. Turn it up and the reference can dominate.

IP-Adapter is a good fit when the reference is what you want preserved (product catalog — next post). It’s usually a poor fit when what you want preserved is only the face.

The solution: custom LoRA per character

A LoRA (Low-Rank Adaptation) is a small set of additional weights on top of a base model. A character-specific LoRA trained on a curated dataset — consistent face, varied pose/outfit/lighting — encodes the identity into the weights.

image_gen.py (illustrative)
workflow = [
"Checkpoint", # base SDXL model
f"LoRA: {char.lora}", # the character's custom LoRA
"FreeU", # quality touch-up
"KSampler", # actual diffusion
]

Every image of Anna is much more likely to stay Anna across poses, outfits, and lighting changes.

Training — public-friendly template

Using the publicly available Kohya_ss SDXL trainer, the training config lives in the tutorial repo — every hyperparameter is a <tune> placeholder you fill in for your subject and base model:

tutorial/03-lora/kohya-config.toml
# Kohya_ss SDXL LoRA training config — generic template
#
# Replace every `<tune>` value based on your dataset and base model.
# Replace every `<path/to/...>` with your actual paths.
# See Kohya docs for the full parameter reference:
# https://github.com/bmaltais/kohya_ss/wiki/LoRA-training-guide
[model_arguments]
pretrained_model_name_or_path = "<path/to/sdxl-base-or-finetune.safetensors>"
v2 = false
v_parameterization = false
[dataset_arguments]
train_data_dir = "./dataset/train" # folders must be named like "5_character_name"
resolution = "1024,1024"
enable_bucket = true
min_bucket_reso = 256
max_bucket_reso = 2048
caption_extension = ".txt"
shuffle_caption = true
keep_tokens = 1
[training_arguments]
output_dir = "./output"
output_name = "<your_character_v1>"
save_model_as = "safetensors"
save_every_n_epochs = 1
save_precision = "fp16"
logging_dir = "./logs"
# ── Training steps and batch — VRAM-bound. Tune for your hardware.
learning_rate = "<tune>" # e.g. 5e-5 .. 1e-4 for SDXL LoRA
max_train_steps = "<tune>" # e.g. 1000 .. 3000 depending on dataset size
train_batch_size = "<tune>" # 1 fits 12GB, 2–4 wants 24GB+
lr_scheduler = "cosine_with_restarts"
lr_warmup_steps = 50
optimizer_type = "AdamW8bit"
mixed_precision = "bf16"
gradient_accumulation_steps = 1
gradient_checkpointing = true
max_grad_norm = 1.0
# ── LoRA shape. Higher dim = more capacity = more overfitting risk.
[network_arguments]
network_module = "networks.lora"
network_dim = "<tune>" # typical: 8 .. 64
network_alpha = "<tune>" # often = network_dim or network_dim/2
[sample_arguments]
sample_every_n_epochs = 1
sample_prompts = "./sample-prompts.txt"
sample_sampler = "euler_a"

The parameters that matter — LR, step count, rank, alpha, dataset size — are subject-dependent. Anime faces converge differently than realistic faces. There is no universal “best” setting.

What to optimize for:

  • Dataset quality over size. 20 clean, varied, captioned images beat 100 messy ones.
  • Varied pose and lighting, constant face.
  • Clean captions. Describe the scene, not the character. “Woman in a garden” is better than “Anna in a garden” so the model learns the face from context.
  • Dedicated rank for face detail. Lower underfits, higher overfits and kills flexibility.

Marginal cost: usually manageable

Training one character LoRA on a rented or owned GPU is usually measured in minutes to hours of compute, depending on dataset and settings. Inference with the LoRA attached often adds little overhead compared with the base generation. At scale, the per-character cost is dominated by dataset curation, not just training.

Production concerns

LoRA hot-swapping. Load the base checkpoint once, swap LoRAs per request. ComfyUI and Diffusers both support this natively.

Dataset hygiene. LoRAs memorize whatever’s in the dataset. Enforce licensing upstream — the LoRA is downstream of the decision.

Storage at scale. LoRA file size depends on base model and rank; expect anything from a few MB to much larger checkpoints. Object storage + hot-LoRA pinning on inference workers keeps latency down.

Face ≠ body. Include full-body shots in the dataset if you need full-body consistency. Expect iteration.

What would change on a rebuild

  • Ship the LoRA pipeline from day 1.
  • Curate datasets manually; don’t scrape.
  • Store base-model version with each LoRA asset — needed for migration when the base updates.
  • Version LoRAs (v1, v2) and keep old versions live for per-character rollback.

Where this lives

HoneyChat uses custom LoRA per character for image and video identity. The pipeline runs on dedicated GPU workers and feeds both the Telegram bot and the web app. Public architecture reference: github.com/sm1ck/honeychat.

Previous: LLM routing per tier. Next: IP-Adapter Plus for a product catalog.

FAQ

Doesn't seed-pinning solve character consistency?

Not reliably. Identical seed plus identical prompt with different batch sizes can produce different faces because the RNG state depends on the batch dimension. Seeds help within the same sampler run, but they are not enough for independent generations over time.

What's wrong with IP-Adapter for character identity?

IP-Adapter pulls features from a reference image into cross-attention. At high weight it can preserve the reference aggressively — including face, lighting, and pose — which may drift the character away from its canonical look. It is a good fit for product rendering, but usually a poor fit when you only want face identity.

Is LoRA training expensive at scale?

Usually not, once the pipeline exists. Training a single character LoRA is often measured in minutes to hours of GPU time depending on dataset and settings. Inference with the LoRA attached often adds little overhead compared with the base generation. At scale the dominant cost is dataset curation and QA, not just training compute.

Why is dataset curation more important than dataset size?

A LoRA memorizes exactly what it sees. 20 clean, varied, well-captioned images of the same subject teach the model the identity without overfitting to a pose or lighting. 100 messy images with repeated poses teach the model 'this angle', not 'this character'.

How do you handle full-body vs face-only consistency?

A LoRA trained on face crops can stabilize the face but will not necessarily preserve body proportions. For full-body consistency, include full-body shots in the dataset. It's an iteration loop — train, evaluate on full-body prompts, augment the dataset, retrain.

Related Articles

Ready to Meet Your Companion?

Free: 20 messages/day. Premium starts at $4.99/mo.

Chat in Browser Telegram Bot