Tenant Validation Service Disruption

Date: March 25, 2026 Severity: High Status: Resolved Affected Features: API event ingestion, MCP endpoint, admin dashboard

Summary

We identified and resolved a service disruption affecting all tenant-authenticated endpoints. API event ingestion, MCP tool access, and the admin dashboard experienced a 95% failure rate for approximately 2 hours. The root cause was database lock contention in the tenant validation layer. Multiple service containers were competing for exclusive access to a shared local database file, causing nearly all authentication checks to fail. The issue was triggered by increased developer load exceeding the concurrency limits of the previous architecture. All services were fully restored and the underlying architecture was permanently improved to eliminate this failure mode.

What Was Affected

Customers relying on Cogniscape experienced:

Failed event ingestion — CLI send-episode calls returned HTTP 500/502 errors
MCP endpoint unavailable — external MCP clients could not complete the initialization handshake
Admin dashboard inaccessible — the reverse proxy serving api.cogniscape.app was offline (pre-existing, unrelated)

During this incident, monitoring dashboards incorrectly showed all services as “healthy.” The healthcheck endpoints were not testing the tenant validation path, masking the actual failure. This has been added to our improvement backlog.

Root Cause

Cogniscape uses a cloud database (Turso) for tenant metadata — customer keys, session bindings, and usage tracking. To reduce latency, the system previously maintained a local replica of this database on the server. The problem: five service containers (API, MCP reader, and three event workers) all shared the same local replica file via a Docker volume. Each connection triggered a synchronization checkpoint that requires an exclusive file lock. With increased developer traffic, these lock requests became near-permanent, causing 95% of tenant validation attempts to fail.

Why did this happen now?

The number of developers actively sending signals to Cogniscape increased significantly. The previous architecture worked under lighter load because lock contention was intermittent. At higher concurrency, the contention became effectively permanent.

Resolution

We replaced the local database replica with a direct connection to the cloud database. This eliminates the shared file entirely — each container connects independently to the remote database without any local state or lock contention.

Aspect	Before	After
Connection type	Local file replica with sync	Direct remote connection
Shared state	5 containers sharing 1 file	No shared state
Lock contention	Permanent under load	Impossible (no local file)
Latency impact	Negligible — database is co-located in the same region	Negligible

This pattern was already proven in production by another service component (analytics), giving us high confidence in the change.

Timeline

Time (UTC)	Event
13:55	Last successful event ingestion from CLI clients
13:59	Failures begin — 95% of requests return HTTP 500 or 502
14:30	Engineering investigation begins
14:34	Root cause identified: database lock contention in tenant validation
15:00	Fix implemented, tested, and submitted for automated code review
15:05	Code review approved
15:08	Fix merged to main branch
15:20	Fix deployed to production — zero lock errors confirmed
15:22	All services fully operational, including reverse proxy and subdomains

Total time to resolution: ~1.5 hours from investigation start

Impact Summary

Metric	Value
Duration of degradation	~2 hours (13:59–15:22 UTC)
Requests affected	~95% of tenant-authenticated requests
Error types observed	HTTP 500, HTTP 502
Data loss	None — events are queued and retried automatically
Customers notified	Proactively via status update

What We Improved

Beyond the immediate fix, this incident drove several architectural improvements:

Eliminated shared local state

All service containers now connect directly to the cloud database. There is no local file to contend over, regardless of how many containers are running.

Removed dead code

Cleaned up 6 synchronization calls, a file recovery mechanism, and related error handling — approximately 70 lines of code that are no longer needed.

Restored reverse proxy management

The TLS reverse proxy was brought under infrastructure management to prevent silent loss during maintenance operations.

Planned Improvements

Healthcheck enhancement — add tenant database connectivity checks to healthcheck endpoints so monitoring accurately reflects service availability
Error rate alerting — set up automated alerts on tenant validation failure rates to detect degradation before it reaches 95%
Deploy pipeline hardening — enforce that production deploys only pull from the canonical main branch, preventing state divergence

Data & Privacy

No customer data was exposed, leaked, or corrupted during this incident. The failure occurred at the authentication layer — requests that failed never reached the data processing pipeline. Events that failed to ingest during the outage are automatically retried by the CLI client.

​Summary

​What Was Affected

​Root Cause

​Why did this happen now?

​Resolution

​Timeline

​Impact Summary

​What We Improved

​Planned Improvements

​Data & Privacy

Summary

What Was Affected

Root Cause

Why did this happen now?

Resolution

Timeline

Impact Summary

What We Improved

Planned Improvements

Data & Privacy