Skip to main content
Date: March 25, 2026 Severity: High Status: Resolved Affected Features: API event ingestion, MCP endpoint, admin dashboard

Summary

We identified and resolved a service disruption affecting all tenant-authenticated endpoints. API event ingestion, MCP tool access, and the admin dashboard experienced a 95% failure rate for approximately 2 hours. The root cause was database lock contention in the tenant validation layer. Multiple service containers were competing for exclusive access to a shared local database file, causing nearly all authentication checks to fail. The issue was triggered by increased developer load exceeding the concurrency limits of the previous architecture. All services were fully restored and the underlying architecture was permanently improved to eliminate this failure mode.

What Was Affected

Customers relying on Cogniscape experienced:
  • Failed event ingestion — CLI send-episode calls returned HTTP 500/502 errors
  • MCP endpoint unavailable — external MCP clients could not complete the initialization handshake
  • Admin dashboard inaccessible — the reverse proxy serving api.cogniscape.app was offline (pre-existing, unrelated)
During this incident, monitoring dashboards incorrectly showed all services as “healthy.” The healthcheck endpoints were not testing the tenant validation path, masking the actual failure. This has been added to our improvement backlog.

Root Cause

Cogniscape uses a cloud database (Turso) for tenant metadata — customer keys, session bindings, and usage tracking. To reduce latency, the system previously maintained a local replica of this database on the server. The problem: five service containers (API, MCP reader, and three event workers) all shared the same local replica file via a Docker volume. Each connection triggered a synchronization checkpoint that requires an exclusive file lock. With increased developer traffic, these lock requests became near-permanent, causing 95% of tenant validation attempts to fail.

Why did this happen now?

The number of developers actively sending signals to Cogniscape increased significantly. The previous architecture worked under lighter load because lock contention was intermittent. At higher concurrency, the contention became effectively permanent.

Resolution

We replaced the local database replica with a direct connection to the cloud database. This eliminates the shared file entirely — each container connects independently to the remote database without any local state or lock contention.
AspectBeforeAfter
Connection typeLocal file replica with syncDirect remote connection
Shared state5 containers sharing 1 fileNo shared state
Lock contentionPermanent under loadImpossible (no local file)
Latency impactNegligible — database is co-located in the same regionNegligible
This pattern was already proven in production by another service component (analytics), giving us high confidence in the change.

Timeline

Time (UTC)Event
13:55Last successful event ingestion from CLI clients
13:59Failures begin — 95% of requests return HTTP 500 or 502
14:30Engineering investigation begins
14:34Root cause identified: database lock contention in tenant validation
15:00Fix implemented, tested, and submitted for automated code review
15:05Code review approved
15:08Fix merged to main branch
15:20Fix deployed to production — zero lock errors confirmed
15:22All services fully operational, including reverse proxy and subdomains
Total time to resolution: ~1.5 hours from investigation start

Impact Summary

MetricValue
Duration of degradation~2 hours (13:59–15:22 UTC)
Requests affected~95% of tenant-authenticated requests
Error types observedHTTP 500, HTTP 502
Data lossNone — events are queued and retried automatically
Customers notifiedProactively via status update

What We Improved

Beyond the immediate fix, this incident drove several architectural improvements:
1

Eliminated shared local state

All service containers now connect directly to the cloud database. There is no local file to contend over, regardless of how many containers are running.
2

Removed dead code

Cleaned up 6 synchronization calls, a file recovery mechanism, and related error handling — approximately 70 lines of code that are no longer needed.
3

Restored reverse proxy management

The TLS reverse proxy was brought under infrastructure management to prevent silent loss during maintenance operations.

Planned Improvements

  • Healthcheck enhancement — add tenant database connectivity checks to healthcheck endpoints so monitoring accurately reflects service availability
  • Error rate alerting — set up automated alerts on tenant validation failure rates to detect degradation before it reaches 95%
  • Deploy pipeline hardening — enforce that production deploys only pull from the canonical main branch, preventing state divergence

Data & Privacy

No customer data was exposed, leaked, or corrupted during this incident. The failure occurred at the authentication layer — requests that failed never reached the data processing pipeline. Events that failed to ingest during the outage are automatically retried by the CLI client.