Cogniscape stores semantic descriptions of what happened in your codebase — never the code itself.
How we process developer activity
Every event that enters Cogniscape passes through four stages before it reaches the knowledge graph. Each stage reduces the payload to only the semantic information needed for developer intelligence.Every typed event (pull requests, reviews, comments, issues, pushes) has a dedicated builder that explicitly selects which fields to include. Unknown event types fall back to a generic builder that still excludes all known noisy and sensitive fields.
What we store vs. what we don’t
The table below shows exactly which fields from a GitHub pull request event are kept and which are discarded.Pull request events
| Field | Stored | Example of what reaches the graph |
|---|---|---|
| Developer | Yes | "alice" |
| Repository | Yes | "acme/backend" |
| Action | Yes | "opened", "merged" |
| PR number | Yes | 42 |
| Title | Yes | "Add retry logic to payment service" |
| State | Yes | "open", "closed" |
| Branch names | Yes | "feat/retry-payments" |
| Labels | Yes | ["bug", "priority:high"] |
| Assignees / Reviewers | Yes | ["bob", "carol"] |
| PR body (description) | Sanitized | Code blocks and inline code removed; surrounding text kept |
| Diff / changed files content | No | Never captured |
| Raw webhook payload | No | Excluded by _ALWAYS_NOISY_FIELDS |
customer_key | No | Excluded |
Review and comment events
| Field | Stored | Notes |
|---|---|---|
| Review state | Yes | "approved", "changes_requested" |
| Review body | Sanitized | Code blocks removed |
| Comment body | Sanitized | Code blocks removed |
diff_hunk | No | Contains actual code diffs — always excluded |
| File path | Yes | "src/payments/retry.ts" (path only, not content) |
Push events
| Field | Stored | Notes |
|---|---|---|
| Branch | Yes | "main" |
| Commit messages | Yes | Human-written text describing intent |
| Commit SHAs | No | Excluded |
| File lists (added/modified/removed) | No | Excluded |
| File contents | No | Never captured by GitHub webhooks |
Understanding LLM-reconstructed content
This is the most important section of this document. Even with all code stripped from stored data, you may occasionally see what looks like source code in an MCP Reader response. Here is why.What the knowledge graph actually contains
When Cogniscape processes a sanitized episode, the Graphiti engine uses an LLM to extract entities and facts in natural language. For example, from a pull request review that discusses a timestamp bug fix, the graph might store: Entities:addNotification— “A helper function that captures the current ID before incrementing to ensure correct timestamp alignment in notification creation.”currentId— “A variable used to generate sequential notification IDs and corresponding timestamp offsets.”
- “The addNotification helper was introduced to fix an off-by-one bug in the MSW handler where the template literal evaluated before the ID increment.”
How code-like content appears in responses
When you query the MCP Reader — for example, asking “What technical issues were found in the notifications PR?” — the following happens:- The MCP Reader searches the knowledge graph and retrieves relevant entities and facts
- These results are passed to your LLM (the one powering your Claude Code, Claude Desktop, or other MCP client)
- Your LLM synthesizes a response from the semantic descriptions
addNotification, currentId) and the fact descriptions are detailed enough to convey the logic, your LLM can reconstruct plausible code as part of its response. It is doing what LLMs do — generating the most helpful answer from the context it received.
The code in such responses is generated by your own LLM at query time, not retrieved from the Cogniscape database. It may not even match your actual implementation — it is the LLM’s best interpretation of the semantic descriptions.
A concrete example
Here is what is stored in the graph versus what your LLM might generate:- What Cogniscape stores
- What your LLM might generate
Technical deep dive
For technical leaders who want to verify these claims, this section provides implementation details.The sanitization regex
All text fields that may contain user-authored content (PR bodies, review bodies, comment bodies) are processed by a code-stripping function before storage:- Fenced code blocks with any language annotation (
```js,```python, etc.) - Inline code wrapped in single backticks
Fields excluded at the builder level
The episode payload builder maintains an exclusion list that applies to every event type, including unknown/future event types:Per-event-type builders
Each supported event type has a dedicated builder function that explicitly selects only the fields needed for semantic understanding. For example, the pull request builder includestitle, action, state, developer, branch, and a sanitized body — but never includes diff content, file contents, or raw payloads.
Event types without a dedicated builder fall back to a generic builder that applies the exclusion list above and passes remaining fields through. This ensures that even new, unmapped event types never leak raw payloads or sensitive identifiers.
Verifying with a direct database query
If you have access to the Neo4j database, you can verify that no source code is stored by running:Summary
| Layer | Protection |
|---|---|
| Webhook reception | Only selected event types are accepted; others are rejected |
| Normalization | Raw payload stored in a field that is excluded from ingestion |
| Sanitization | Code blocks, inline code, diff hunks, and sensitive fields are stripped |
| Knowledge graph | LLM extracts natural-language entities and facts, not code |
| MCP Reader | Returns semantic search results; any code in the final response is generated by the client’s own LLM |
Questions?
If you have questions about how Cogniscape handles your data, contact us at [email protected].