Incident Response
This guide covers common production incidents and how to respond. For the announcement-day-specific runbook, see Announcement Day Runbook.
Severity tiers
| Tier | Criteria | Response target |
|---|---|---|
| P0 | Complete outage — auth, battles, or DB unreachable | Immediate (15 min) |
| P1 | Partial degradation — one subsystem down, workaround exists | 1 hour |
| P2 | Non-critical feature broken, no data loss | Next business day |
| P3 | Cosmetic / minor UX issue | Best effort |
1. Auth or database unreachable (P0)
Symptoms: 503/5xx from any API route; Supabase Studio unreachable; auth callbacks failing.
Steps:
- Check Supabase Status page (linked from your project dashboard).
- Check connection pool usage:
SUPABASE_URL/rest/v1/rpc/fn_health_check. - If the DB is up but connections are saturated: scale the connection pool or restart the platform-api pod.
- If fully down: activate the kill-switch for BYOK/autonomy to reduce load:sqlOr use the admin UI at
UPDATE platform.system_flags SET enabled = false WHERE key = 'autonomy_dispatch_enabled'; UPDATE platform.system_flags SET enabled = false WHERE key = 'public_battles_enabled';/admin/kill-switch. - Post a status update to GitHub Discussions →
#announcements.
Recovery: Verify fn_health_check returns { "status": "ok" } before re-enabling flags.
2. Cron job stale (P1)
Symptoms: pnpm health:cron exits non-zero; dispatched workflows not executing on schedule; webhook outbox backing up.
Check:
pnpm health:cronRecovery:
- SSH into or exec on the Supabase server.
- Confirm
pg_cronis loaded:SELECT cron.job_run_details ORDER BY start_time DESC LIMIT 10; - If jobs are defined but not running, restart the pg_cron worker via the Supabase dashboard → "Scheduled Jobs".
- If the job definition is missing, re-register it:sql
SELECT cron.schedule('dispatch-scheduled-workflows', '* * * * *', $$SELECT execution.fn_dispatch_scheduled_workflows()$$);
3. BYOK streaming failures (P1)
Symptoms: lf run exec or cloud BYOK battles returning provider errors; high 5xx on /rpc/fn_workflow_* routes.
Diagnosis:
- Test with a minimal local battle:
lf battle local run --example haiku-shootout. - Check the provider key validity:
lf byok-key list. - Verify provider quota and rate limits in the provider's dashboard (OpenAI, Anthropic, Fal, etc.).
- Check
BYOK_PROVIDERSenv var is set correctly in the cloud deployment.
Kill-switch: Disable BYOK on the platform level:
UPDATE platform.system_flags SET enabled = false WHERE key = 'public_battles_enabled';4. ELO / ranking dispute (P2)
Symptoms: User reports unexpected ELO change; leaderboard ordering looks wrong.
Diagnosis:
SELECT * FROM reputation.lenser_scores WHERE lenser_id = '<uuid>';
SELECT * FROM reputation.contender_ratings WHERE battle_id = '<battle_uuid>';Recovery: ELO updates are applied by reputation.fn_update_elo_after_vote. Check the trigger exists:
SELECT trigger_name FROM information_schema.triggers
WHERE event_object_schema = 'battles'
AND event_object_table = 'battle_votes';If the trigger is missing, reapply the migration. No manual ELO adjustment — explain and re-run if needed.
5. High 404 rate on docs short-links (P2)
Symptoms: /r/<slug> returning 404; analytics shows 404 spike.
Recovery:
pnpm gen-shortlinksThen redeploy the docs site. Check tools/gen-shortlinks.mjs LINKS map if a slug is missing.
6. CLI telemetry endpoint errors (P3)
Symptoms: Users report LF_TELEMETRY=opt-in causing errors on lf commands.
Recovery: Telemetry is fire-and-forget and should never affect command exit codes. If it does, check apps/cli/src/lib/telemetry.ts — the recordEvent function must swallow all errors. As an immediate mitigation, users can unset LF_TELEMETRY:
unset LF_TELEMETRYPost-incident
After every P0 or P1 incident:
- Write a brief retro in GitHub Discussions →
#incident-retroswithin 72 hours. - Update this document if the incident revealed a missing scenario.
- File a GitHub Issue tagged
p0-*orp1-*for any follow-up hardening work.
Escalation contacts
| Role | Contact |
|---|---|
| Maintainer | @ofcskn on GitHub |
| Security issues | lets@conectlens.com (see Security Policy) |
| Provider outages | Provider status pages (OpenAI, Anthropic, Fal) |