> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cogniscape.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Tenant Validation Service Disruption

> Incident report on API and MCP endpoint degradation caused by database lock contention, resolved within 2 hours.

**Date:** March 25, 2026
**Severity:** High
**Status:** Resolved
**Affected Features:** API event ingestion, MCP endpoint, admin dashboard

***

## Summary

We identified and resolved a service disruption affecting all tenant-authenticated endpoints. API event ingestion, MCP tool access, and the admin dashboard experienced a **95% failure rate** for approximately 2 hours.

The root cause was database lock contention in the tenant validation layer. Multiple service containers were competing for exclusive access to a shared local database file, causing nearly all authentication checks to fail. The issue was triggered by increased developer load exceeding the concurrency limits of the previous architecture.

All services were fully restored and the underlying architecture was permanently improved to eliminate this failure mode.

***

## What Was Affected

Customers relying on Cogniscape experienced:

* **Failed event ingestion** — CLI `send-episode` calls returned HTTP 500/502 errors
* **MCP endpoint unavailable** — external MCP clients could not complete the initialization handshake
* **Admin dashboard inaccessible** — the reverse proxy serving `api.cogniscape.app` was offline (pre-existing, unrelated)

<Warning>
  During this incident, monitoring dashboards incorrectly showed all services as "healthy." The healthcheck endpoints were not testing the tenant validation path, masking the actual failure. This has been added to our improvement backlog.
</Warning>

***

## Root Cause

Cogniscape uses a cloud database ([Turso](https://turso.tech)) for tenant metadata — customer keys, session bindings, and usage tracking. To reduce latency, the system previously maintained a **local replica** of this database on the server.

The problem: **five service containers** (API, MCP reader, and three event workers) all shared the same local replica file via a Docker volume. Each connection triggered a synchronization checkpoint that requires an exclusive file lock. With increased developer traffic, these lock requests became near-permanent, causing 95% of tenant validation attempts to fail.

### Why did this happen now?

The number of developers actively sending signals to Cogniscape increased significantly. The previous architecture worked under lighter load because lock contention was intermittent. At higher concurrency, the contention became effectively permanent.

***

## Resolution

We replaced the local database replica with a **direct connection** to the cloud database. This eliminates the shared file entirely — each container connects independently to the remote database without any local state or lock contention.

| Aspect              | Before                                                 | After                      |
| ------------------- | ------------------------------------------------------ | -------------------------- |
| **Connection type** | Local file replica with sync                           | Direct remote connection   |
| **Shared state**    | 5 containers sharing 1 file                            | No shared state            |
| **Lock contention** | Permanent under load                                   | Impossible (no local file) |
| **Latency impact**  | Negligible — database is co-located in the same region | Negligible                 |

This pattern was already proven in production by another service component (analytics), giving us high confidence in the change.

***

## Timeline

| Time (UTC) | Event                                                                  |
| ---------- | ---------------------------------------------------------------------- |
| 13:55      | Last successful event ingestion from CLI clients                       |
| 13:59      | Failures begin — 95% of requests return HTTP 500 or 502                |
| 14:30      | Engineering investigation begins                                       |
| 14:34      | Root cause identified: database lock contention in tenant validation   |
| 15:00      | Fix implemented, tested, and submitted for automated code review       |
| 15:05      | Code review approved                                                   |
| 15:08      | Fix merged to main branch                                              |
| 15:20      | Fix deployed to production — **zero lock errors** confirmed            |
| 15:22      | All services fully operational, including reverse proxy and subdomains |

**Total time to resolution: \~1.5 hours from investigation start**

***

## Impact Summary

| Metric                  | Value                                              |
| ----------------------- | -------------------------------------------------- |
| Duration of degradation | \~2 hours (13:59–15:22 UTC)                        |
| Requests affected       | \~95% of tenant-authenticated requests             |
| Error types observed    | HTTP 500, HTTP 502                                 |
| Data loss               | None — events are queued and retried automatically |
| Customers notified      | Proactively via status update                      |

***

## What We Improved

Beyond the immediate fix, this incident drove several architectural improvements:

<Steps>
  <Step title="Eliminated shared local state">
    All service containers now connect directly to the cloud database. There is no local file to contend over, regardless of how many containers are running.
  </Step>

  <Step title="Removed dead code">
    Cleaned up 6 synchronization calls, a file recovery mechanism, and related error handling — approximately 70 lines of code that are no longer needed.
  </Step>

  <Step title="Restored reverse proxy management">
    The TLS reverse proxy was brought under infrastructure management to prevent silent loss during maintenance operations.
  </Step>
</Steps>

***

## Planned Improvements

* **Healthcheck enhancement** — add tenant database connectivity checks to healthcheck endpoints so monitoring accurately reflects service availability
* **Error rate alerting** — set up automated alerts on tenant validation failure rates to detect degradation before it reaches 95%
* **Deploy pipeline hardening** — enforce that production deploys only pull from the canonical main branch, preventing state divergence

***

## Data & Privacy

No customer data was exposed, leaked, or corrupted during this incident. The failure occurred at the authentication layer — requests that failed never reached the data processing pipeline. Events that failed to ingest during the outage are automatically retried by the CLI client.
