Skip to main content
Cogniscape captures development activity metadata — who did what, when, and why — not source code. Every webhook payload from GitHub, Linear, and Claude Code passes through a multi-stage sanitization pipeline that strips code blocks, diff hunks, and raw payloads before anything reaches the knowledge graph. In rare cases, the MCP Reader may return content that resembles code. This is not stored code leaking out — it is your own LLM reconstructing code-like patterns from the semantic descriptions stored in the graph. This page explains exactly how that works and why your source code remains safe.
Cogniscape stores semantic descriptions of what happened in your codebase — never the code itself.

How we process developer activity

Every event that enters Cogniscape passes through four stages before it reaches the knowledge graph. Each stage reduces the payload to only the semantic information needed for developer intelligence.
Stage 1: Webhook Reception
  GitHub / Linear / Claude Code → raw webhook payload received

Stage 2: Normalization
  Raw payload → structured DevEvent model
  Only selected fields are mapped (title, action, developer, timestamps)
  The full raw payload is tagged for exclusion

Stage 3: Sanitization (Episode Payload Builder)
  Fenced code blocks (```...```) → removed
  Inline code (`...`) → removed
  diff_hunk fields → excluded
  raw payload → excluded
  Sensitive fields (customer_key, commit_sha) → excluded

Stage 4: Knowledge Graph Ingestion
  Sanitized JSON → Graphiti LLM → entities, facts, and episodes
  The LLM extracts semantic meaning: "who did what to what, and why"
  Output: natural-language descriptions, not code
Every typed event (pull requests, reviews, comments, issues, pushes) has a dedicated builder that explicitly selects which fields to include. Unknown event types fall back to a generic builder that still excludes all known noisy and sensitive fields.

What we store vs. what we don’t

The table below shows exactly which fields from a GitHub pull request event are kept and which are discarded.

Pull request events

FieldStoredExample of what reaches the graph
DeveloperYes"alice"
RepositoryYes"acme/backend"
ActionYes"opened", "merged"
PR numberYes42
TitleYes"Add retry logic to payment service"
StateYes"open", "closed"
Branch namesYes"feat/retry-payments"
LabelsYes["bug", "priority:high"]
Assignees / ReviewersYes["bob", "carol"]
PR body (description)SanitizedCode blocks and inline code removed; surrounding text kept
Diff / changed files contentNoNever captured
Raw webhook payloadNoExcluded by _ALWAYS_NOISY_FIELDS
customer_keyNoExcluded

Review and comment events

FieldStoredNotes
Review stateYes"approved", "changes_requested"
Review bodySanitizedCode blocks removed
Comment bodySanitizedCode blocks removed
diff_hunkNoContains actual code diffs — always excluded
File pathYes"src/payments/retry.ts" (path only, not content)

Push events

FieldStoredNotes
BranchYes"main"
Commit messagesYesHuman-written text describing intent
Commit SHAsNoExcluded
File lists (added/modified/removed)NoExcluded
File contentsNoNever captured by GitHub webhooks
Commit messages are written by developers and may occasionally reference code patterns. Cogniscape stores them as-is because they represent developer intent, not source code.

Understanding LLM-reconstructed content

This is the most important section of this document. Even with all code stripped from stored data, you may occasionally see what looks like source code in an MCP Reader response. Here is why.

What the knowledge graph actually contains

When Cogniscape processes a sanitized episode, the Graphiti engine uses an LLM to extract entities and facts in natural language. For example, from a pull request review that discusses a timestamp bug fix, the graph might store: Entities:
  • addNotification“A helper function that captures the current ID before incrementing to ensure correct timestamp alignment in notification creation.”
  • currentId“A variable used to generate sequential notification IDs and corresponding timestamp offsets.”
Facts:
  • “The addNotification helper was introduced to fix an off-by-one bug in the MSW handler where the template literal evaluated before the ID increment.”
These are natural-language descriptions. There is no stored code.

How code-like content appears in responses

When you query the MCP Reader — for example, asking “What technical issues were found in the notifications PR?” — the following happens:
  1. The MCP Reader searches the knowledge graph and retrieves relevant entities and facts
  2. These results are passed to your LLM (the one powering your Claude Code, Claude Desktop, or other MCP client)
  3. Your LLM synthesizes a response from the semantic descriptions
Because the entity names are code identifiers (addNotification, currentId) and the fact descriptions are detailed enough to convey the logic, your LLM can reconstruct plausible code as part of its response. It is doing what LLMs do — generating the most helpful answer from the context it received.
The code in such responses is generated by your own LLM at query time, not retrieved from the Cogniscape database. It may not even match your actual implementation — it is the LLM’s best interpretation of the semantic descriptions.

A concrete example

Here is what is stored in the graph versus what your LLM might generate:
{
  "entity": "addNotification",
  "summary": "A helper function that captures the current ID before incrementing to ensure correct timestamp alignment."
}
Fact: "The template literal was previously causing potential
misalignment due to how the ID was being incremented
inside the function."

Technical deep dive

For technical leaders who want to verify these claims, this section provides implementation details.

The sanitization regex

All text fields that may contain user-authored content (PR bodies, review bodies, comment bodies) are processed by a code-stripping function before storage:
# Matches fenced code blocks (```...```) and inline code (`...`)
_CODE_BLOCK_RE = re.compile(r"```[\s\S]*?```|`[^`\n]+`")

def _strip_code_blocks(text: str) -> str:
    return _CODE_BLOCK_RE.sub("", text).strip()
This removes:
  • Fenced code blocks with any language annotation (```js, ```python, etc.)
  • Inline code wrapped in single backticks

Fields excluded at the builder level

The episode payload builder maintains an exclusion list that applies to every event type, including unknown/future event types:
_ALWAYS_NOISY_FIELDS = frozenset({
    "customer_id",
    "customer_key",
    "raw",           # Full webhook payload — always excluded
    "commit_sha",
    "event_id",
    "github_delivery_id",
    "session_id",
    "tags",
    "task_id",
    "timestamp",
})

Per-event-type builders

Each supported event type has a dedicated builder function that explicitly selects only the fields needed for semantic understanding. For example, the pull request builder includes title, action, state, developer, branch, and a sanitized body — but never includes diff content, file contents, or raw payloads. Event types without a dedicated builder fall back to a generic builder that applies the exclusion list above and passes remaining fields through. This ensures that even new, unmapped event types never leak raw payloads or sensitive identifiers.

Verifying with a direct database query

If you have access to the Neo4j database, you can verify that no source code is stored by running:
// Search for common code patterns across all episodes
MATCH (e:Episodic)
WHERE e.content CONTAINS 'function('
   OR e.content CONTAINS '=>'
   OR e.content CONTAINS 'const '
   OR e.content CONTAINS 'import '
RETURN e.name, substring(e.content, 0, 200) AS preview
LIMIT 10
For properly sanitized data, this query returns either zero results or results where the matched terms appear in natural-language context (e.g., a commit message saying “add retry function”), not in code syntax.

Summary

LayerProtection
Webhook receptionOnly selected event types are accepted; others are rejected
NormalizationRaw payload stored in a field that is excluded from ingestion
SanitizationCode blocks, inline code, diff hunks, and sensitive fields are stripped
Knowledge graphLLM extracts natural-language entities and facts, not code
MCP ReaderReturns semantic search results; any code in the final response is generated by the client’s own LLM

Questions?

If you have questions about how Cogniscape handles your data, contact us at [email protected].