Summary
We identified and resolved a service disruption affecting all tenant-authenticated endpoints. API event ingestion, MCP tool access, and the admin dashboard experienced a 95% failure rate for approximately 2 hours. The root cause was database lock contention in the tenant validation layer. Multiple service containers were competing for exclusive access to a shared local database file, causing nearly all authentication checks to fail. The issue was triggered by increased developer load exceeding the concurrency limits of the previous architecture. All services were fully restored and the underlying architecture was permanently improved to eliminate this failure mode.What Was Affected
Customers relying on Cogniscape experienced:- Failed event ingestion — CLI
send-episodecalls returned HTTP 500/502 errors - MCP endpoint unavailable — external MCP clients could not complete the initialization handshake
- Admin dashboard inaccessible — the reverse proxy serving
api.cogniscape.appwas offline (pre-existing, unrelated)
Root Cause
Cogniscape uses a cloud database (Turso) for tenant metadata — customer keys, session bindings, and usage tracking. To reduce latency, the system previously maintained a local replica of this database on the server. The problem: five service containers (API, MCP reader, and three event workers) all shared the same local replica file via a Docker volume. Each connection triggered a synchronization checkpoint that requires an exclusive file lock. With increased developer traffic, these lock requests became near-permanent, causing 95% of tenant validation attempts to fail.Why did this happen now?
The number of developers actively sending signals to Cogniscape increased significantly. The previous architecture worked under lighter load because lock contention was intermittent. At higher concurrency, the contention became effectively permanent.Resolution
We replaced the local database replica with a direct connection to the cloud database. This eliminates the shared file entirely — each container connects independently to the remote database without any local state or lock contention.| Aspect | Before | After |
|---|---|---|
| Connection type | Local file replica with sync | Direct remote connection |
| Shared state | 5 containers sharing 1 file | No shared state |
| Lock contention | Permanent under load | Impossible (no local file) |
| Latency impact | Negligible — database is co-located in the same region | Negligible |
Timeline
| Time (UTC) | Event |
|---|---|
| 13:55 | Last successful event ingestion from CLI clients |
| 13:59 | Failures begin — 95% of requests return HTTP 500 or 502 |
| 14:30 | Engineering investigation begins |
| 14:34 | Root cause identified: database lock contention in tenant validation |
| 15:00 | Fix implemented, tested, and submitted for automated code review |
| 15:05 | Code review approved |
| 15:08 | Fix merged to main branch |
| 15:20 | Fix deployed to production — zero lock errors confirmed |
| 15:22 | All services fully operational, including reverse proxy and subdomains |
Impact Summary
| Metric | Value |
|---|---|
| Duration of degradation | ~2 hours (13:59–15:22 UTC) |
| Requests affected | ~95% of tenant-authenticated requests |
| Error types observed | HTTP 500, HTTP 502 |
| Data loss | None — events are queued and retried automatically |
| Customers notified | Proactively via status update |
What We Improved
Beyond the immediate fix, this incident drove several architectural improvements:Eliminated shared local state
All service containers now connect directly to the cloud database. There is no local file to contend over, regardless of how many containers are running.
Removed dead code
Cleaned up 6 synchronization calls, a file recovery mechanism, and related error handling — approximately 70 lines of code that are no longer needed.
Planned Improvements
- Healthcheck enhancement — add tenant database connectivity checks to healthcheck endpoints so monitoring accurately reflects service availability
- Error rate alerting — set up automated alerts on tenant validation failure rates to detect degradation before it reaches 95%
- Deploy pipeline hardening — enforce that production deploys only pull from the canonical main branch, preventing state divergence