Talk Production Runbook
Service Health Checks
talk-server: GET /health, GET /ready, GET /metrics
agents: GET /healthz, GET /metrics
redis: redis-cli ping
- External dependencies:
- ASR:
DWANI_API_BASE_URL_ASR
- TTS:
DWANI_API_BASE_URL_TTS
- LLM:
DWANI_API_BASE_URL_LLM
Common Incidents
502 from /v1/chat or /v1/speech_to_speech
- Validate upstream service URL env vars.
- Check dependency health (
/ready response details).
- Check
agents health if using mode=agent.
504 timeout from speech endpoint
- Verify ASR/TTS latency and service logs.
- Increase
DWANI_ASR_TIMEOUT/DWANI_TTS_TIMEOUT if model nodes are healthy but slow.
- Inspect queue/load and scale upstream services.
Authentication failures (401)
- Confirm
DWANI_API_KEY and AGENTS_API_KEY values are set consistently.
- Ensure UI/backend clients pass
X-API-Key when auth is enabled.
Session loss after restart
- Confirm Redis container is healthy and reachable from
talk-server.
- Validate
DWANI_REDIS_URL.
Operational Commands
docker compose ps
docker compose logs -f talk
docker compose logs -f agents
docker compose logs -f redis
Rollback
- Set image tags to previous known good versions (
DWANI_TALK_*_TAG).
docker compose pull
docker compose up -d
- Re-run health checks and a
/v1/chat smoke test.