Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,27 @@ Your contribution's README must include these sections:
4. **Expected outcome** — What should the user see when it's working? Be specific.
5. **Troubleshooting** — At least 2-3 common issues and how to fix them.

### Sensitive Text Ingestion Rule

If your contribution imports or captures raw text that will be embedded or stored in Open Brain, it must use the [Sensitive Data Redaction](primitives/sensitive-data-redaction/) primitive.

This applies to:
- Email importers
- Chat export importers
- Social, blog, and document importers
- Bulk capture pipelines that ingest third-party text

This does not apply to:
- Dashboards
- Schema-only contributions
- Metadata backfills or analytics jobs that do not ingest new raw text

If this rule applies, your contribution must:
- Add `"requires_primitives": ["sensitive-data-redaction"]` to `metadata.json`
- Link the primitive in the README
- Apply redaction before embeddings and before database insert
- Default the redaction pass to on, even if you expose an explicit opt-out flag

### Visual Formatting Requirements

These patterns are required for **extensions** and strongly recommended for all other contributions. They match the [Getting Started guide](docs/01-getting-started.md) and make guides scannable, beginner-friendly, and consistent across the repo.
Expand Down Expand Up @@ -237,6 +258,30 @@ Example for a recipe that depends on a reusable skill:
}
```

Example for a raw-text ingestion recipe that depends on the redaction primitive:

```json
{
"name": "Email History Import",
"description": "Import your Gmail history into Open Brain as searchable thoughts.",
"category": "recipes",
"author": {
"name": "Your Name",
"github": "your-github-username"
},
"version": "1.0.0",
"requires": {
"open_brain": true,
"services": ["Gmail API"],
"tools": ["Deno"]
},
"requires_primitives": ["sensitive-data-redaction"],
"tags": ["email", "gmail", "import"],
"difficulty": "intermediate",
"estimated_time": "30 minutes"
}
```

## PR Format

**Title:** `[category] Short description`
Expand Down Expand Up @@ -298,3 +343,5 @@ Every PR is checked against these rules. All must pass before human review.
13. **Internal links** — All relative links in READMEs resolve to existing files
14. **Remote MCP pattern** — Extensions and integrations must use remote MCP via Supabase Edge Functions. No `claude_desktop_config.json`, no local Node.js stdio servers. See the [Getting Started guide](docs/01-getting-started.md) for the correct pattern
15. **Tool audit link** — Extensions and integrations must link to the [MCP Tool Audit & Optimization Guide](docs/05-tool-audit.md) in their README. This ensures users are aware of tool surface area management as they add capabilities

For ingestion contributions, human review will also check that the [Sensitive Data Redaction](primitives/sensitive-data-redaction/) primitive is declared and applied before embeddings/storage.
1 change: 1 addition & 0 deletions primitives/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Primitives are reusable concept guides that show up in multiple extensions. Lear
| [Common Troubleshooting](troubleshooting/) | Solutions for connection, deployment, and database issues | All extensions |
| [Row Level Security](rls/) | PostgreSQL policies for multi-user data isolation | Extensions 4, 5, 6 |
| [Shared MCP Server](shared-mcp/) | Giving others scoped access to parts of your brain | Extension 4 |
| [Sensitive Data Redaction](sensitive-data-redaction/) | Pre-ingest masking and skipping of secrets before storage or embeddings | Email History Import, Obsidian Vault Import |

## How Primitives Work

Expand Down
97 changes: 97 additions & 0 deletions primitives/sensitive-data-redaction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Sensitive Data Redaction

> A standard pre-ingest pass for masking or skipping sensitive strings before external text is embedded or stored in Open Brain.

## What It Is

Sensitive Data Redaction is the baseline safety layer for ingestion contributions. Its job is simple: preserve the useful context of imported text while removing exact strings that create unnecessary risk if they land in embeddings, stored content, logs, exports, or downstream AI retrieval.

This primitive is not a full enterprise DLP system. It is a deterministic, maintainable default for a solo-operator stack. When a contribution imports raw external or user-authored text into Open Brain, it should run this pass before embedding and before database insert.

## Why It Matters

Most imported content is valuable because of its meaning, not because it contains exact credentials or high-risk identifiers. An email that says a client shared a production Stripe key is useful memory. The exact Stripe key is not useful memory. It is a liability.

That distinction is the policy:

- Keep semantic context.
- Remove exact secrets.
- Skip content entirely when the payload is too dangerous to keep, such as private key blocks.

This protects the obvious high-risk cases without turning Open Brain into a sterile archive. Your AI still remembers what happened. It just does not keep live credentials around when a placeholder will do.

## What Must Require This Primitive

Any recipe, integration, or extension that imports, syncs, scrapes, forwards, summarizes, or bulk-captures raw text before storage or embedding must declare this primitive in `metadata.json` and link it in its README.

That includes:

- Email and inbox importers
- Chat export importers
- Social, blog, and document importers
- Automated capture pipelines that ingest raw third-party text

That does not include:

- Dashboards
- Schema-only contributions
- Analytics or metadata backfills that do not ingest new raw text

## How It Works

The primitive ships a canonical `patterns.json` file with deterministic regex rules and two actions:

- `redact`: replace the exact sensitive string with a placeholder such as `[REDACTED_API_KEY]`
- `skip`: reject the content entirely because partial masking is not enough

The intended pipeline is:

1. Normalize and clean imported text.
2. Run sensitive-data redaction.
3. If the content is marked `skip`, do not embed or insert it.
4. If the content is redacted, embed and store the redacted version.
5. Record redaction labels/counts in metadata when helpful.

## Common Patterns

### Redact In Place

Use redaction for API keys, bearer tokens, connection strings with embedded credentials, SSNs, reset links, and other exact strings that create blast radius if retrieved verbatim later.

### Skip Entire Content

Use skip rules for private key blocks and similar payloads where storing a partially masked version still creates too much risk or too little value.

## Step-by-Step Guide

1. Add `"requires_primitives": ["sensitive-data-redaction"]` to the contribution metadata.
2. Link this primitive in the contribution README prerequisites or ingestion section.
3. Apply the policy before embeddings and before database insert.
4. Default the redaction pass to on. If you expose an opt-out flag, make it explicit and clearly marked as not recommended.
5. Log what happened. At minimum, report redacted counts and skipped items so users can sanity-check imports.

## Expected Outcome

An ingestion contribution that uses this primitive keeps the useful meaning of imported content while masking exact secrets. Users can still search and retrieve context, but high-risk strings do not get embedded or stored verbatim by default. A dry run should make it obvious what would be redacted and what would be skipped.

## Troubleshooting

**Issue: The scanner flags a false positive**
Solution: Keep the rule set deterministic and conservative. If a specific importer needs an override flag, expose one explicitly and document the tradeoff.

**Issue: A recipe fails because `patterns.json` is missing**
Solution: The contribution depends on this primitive. Keep the repo structure intact, or copy the `primitives/sensitive-data-redaction/` folder alongside the recipe when running it standalone.

**Issue: Users complain that too much context is removed**
Solution: The rule set should bias toward placeholder replacement, not blanket deletion. If a rule is dropping useful content, change it from `skip` to `redact` or tighten the regex.

## Extensions That Use This

- Future ingestion-focused extensions should use this primitive as their default policy layer.
- Today the policy is already wired into [Email History Import](../../recipes/email-history-import/) and [Obsidian Vault Import](../../recipes/obsidian-vault-import/).

## Further Reading

- [Contributing Guide](../../CONTRIBUTING.md)
- [Email History Import](../../recipes/email-history-import/)
- [Obsidian Vault Import](../../recipes/obsidian-vault-import/)
20 changes: 20 additions & 0 deletions primitives/sensitive-data-redaction/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"name": "Sensitive Data Redaction",
"description": "A standard pre-ingest redaction pass for secrets and high-risk identifiers before external text is embedded or stored in Open Brain.",
"category": "primitives",
"author": {
"name": "Nate B. Jones",
"github": "NateBJones"
},
"version": "1.0.0",
"requires": {
"open_brain": true,
"services": [],
"tools": []
},
"tags": ["security", "privacy", "redaction", "ingestion", "sensitive-data"],
"difficulty": "intermediate",
"estimated_time": "20 minutes",
"created": "2026-04-01",
"updated": "2026-04-01"
}
93 changes: 93 additions & 0 deletions primitives/sensitive-data-redaction/patterns.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
{
"version": "1.0.0",
"rules": [
{
"label": "Private key block",
"action": "skip",
"pattern": "-----BEGIN [A-Z ]+ PRIVATE KEY-----",
"flags": "i"
},
{
"label": "OpenAI or OpenRouter API key",
"action": "redact",
"placeholder": "[REDACTED_API_KEY]",
"pattern": "sk-(?:or-v1-|proj-|live-)?[A-Za-z0-9]{20,}"
},
{
"label": "Stripe secret key",
"action": "redact",
"placeholder": "[REDACTED_API_KEY]",
"pattern": "sk_(?:live|test)_[A-Za-z0-9]{16,}"
},
{
"label": "Google API key",
"action": "redact",
"placeholder": "[REDACTED_API_KEY]",
"pattern": "AIza[0-9A-Za-z\\-_]{35}"
},
{
"label": "JWT token",
"action": "redact",
"placeholder": "[REDACTED_JWT]",
"pattern": "eyJ[A-Za-z0-9_-]{10,}\\.[A-Za-z0-9._-]{10,}\\.[A-Za-z0-9._-]{10,}"
},
{
"label": "GitHub token",
"action": "redact",
"placeholder": "[REDACTED_GITHUB_TOKEN]",
"pattern": "gh(?:p|s|o|u|r)_[A-Za-z0-9]{20,}"
},
{
"label": "Slack token",
"action": "redact",
"placeholder": "[REDACTED_SLACK_TOKEN]",
"pattern": "xox(?:b|p|a|o|r|s)-[A-Za-z0-9-]{10,}"
},
{
"label": "AWS access key",
"action": "redact",
"placeholder": "[REDACTED_AWS_ACCESS_KEY]",
"pattern": "AKIA[0-9A-Z]{16}"
},
{
"label": "Supabase secret key",
"action": "redact",
"placeholder": "[REDACTED_SUPABASE_SECRET]",
"pattern": "sb_secret_[A-Za-z0-9]+"
},
{
"label": "Bearer token",
"action": "redact",
"placeholder": "[REDACTED_BEARER_TOKEN]",
"pattern": "Bearer\\s+[A-Za-z0-9._~+\\/-]{20,}",
"flags": "i"
},
{
"label": "Database connection string with credentials",
"action": "redact",
"placeholder": "[REDACTED_DB_CREDENTIALS]",
"pattern": "(?:postgres|postgresql|mysql|mongodb|redis):\\/\\/[^\\s:@/]+:[^\\s@/]+@",
"flags": "i"
},
{
"label": "Generic secret assignment",
"action": "redact",
"placeholder": "[REDACTED_SECRET]",
"pattern": "(?:password|passwd|secret|token|api[_-]?key|apikey|api[_-]?secret|access[_-]?token|auth[_-]?token)\\s*[:=]\\s*[\"']?[A-Za-z0-9_\\-./]{12,}",
"flags": "i"
},
{
"label": "URL token parameter",
"action": "redact",
"placeholder": "[REDACTED_URL_SECRET]",
"pattern": "https?:\\/\\/[^\\s]+[?&](?:token|code|access_token|refresh_token|api_key|apikey|auth|sig|signature)=[^\\s&#]+",
"flags": "i"
},
{
"label": "US social security number",
"action": "redact",
"placeholder": "[REDACTED_SSN]",
"pattern": "\\b\\d{3}-\\d{2}-\\d{4}\\b"
}
]
}
2 changes: 1 addition & 1 deletion recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

https://github.com/user-attachments/assets/9454662f-2648-4928-8723-f7d52e94e9b8

Step-by-step builds that add a new capability to your Open Brain. Follow the instructions, run the code, get a new feature. Some recipes depend on canonical skill packs in [`skills/`](../skills/); those recipes should install the skill first, then use the recipe for workflow and composition.
Step-by-step builds that add a new capability to your Open Brain. Follow the instructions, run the code, get a new feature. Some recipes depend on canonical skill packs in [`skills/`](../skills/), and raw-text ingestion recipes may also depend on primitives in [`primitives/`](../primitives/) such as [Sensitive Data Redaction](../primitives/sensitive-data-redaction/).

| Recipe | What It Does |
| ------ | ------------ |
Expand Down
1 change: 1 addition & 0 deletions recipes/_template/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
- Working Open Brain setup ([guide](../../docs/01-getting-started.md))
- List any additional requirements (API keys, tools, services)
- If this recipe depends on a reusable skill from `skills/`, link it here and declare it in `metadata.json` via `requires_skills`
- If this recipe imports raw text for storage or embeddings, link the [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive here and declare it in `metadata.json` via `requires_primitives`

## Credential Tracker

Expand Down
1 change: 1 addition & 0 deletions recipes/_template/metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
"services": [],
"tools": []
},
"requires_primitives": [],
"requires_skills": [],
"tags": ["tag1", "tag2"],
"difficulty": "beginner",
Expand Down
27 changes: 20 additions & 7 deletions recipes/email-history-import/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Pulls your Gmail history via the Gmail API and loads each email into Open Brain
- Google Cloud project with Gmail API enabled
- Gmail API OAuth credentials (Client ID + Client Secret)
- OpenRouter API key (same one from your Open Brain setup)
- [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive (required for the default pre-ingest masking pass)

## Credential Tracker

Expand Down Expand Up @@ -85,6 +86,7 @@ deno run --allow-net --allow-read --allow-write --allow-env pull-gmail.ts --list
| `--dry-run` | off | Preview without ingesting |
| `--list-labels` | off | List all Gmail labels and exit |
| `--ingest-endpoint` | off | Use `INGEST_URL`/`INGEST_KEY` instead of Supabase direct insert |
| `--no-redact` | off | Disable sensitive-data redaction before embedding/storage (not recommended) |

### Ingestion modes

Expand All @@ -96,11 +98,12 @@ deno run --allow-net --allow-read --allow-write --allow-env pull-gmail.ts --list

1. **Fetch** emails from Gmail API by label and time window
2. **Extract** body (base64 decode, HTML-to-text, strip quoted replies and signatures)
3. **Filter** out noise (no-reply senders, receipts, auto-generated, <10 words)
4. **Deduplicate** via sync-log (tracks Gmail message IDs already imported)
5. **Embed** content via OpenRouter (`text-embedding-3-small`)
6. **Classify** via LLM (topics, type, people, action items)
7. **Upsert** into Supabase with SHA-256 [content fingerprint](../../primitives/content-fingerprint-dedup/README.md) — re-running produces zero duplicates
3. **Redact** sensitive strings via the [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive
4. **Filter** out noise (no-reply senders, receipts, auto-generated, <10 words)
5. **Deduplicate** via sync-log (tracks Gmail message IDs already imported)
6. **Embed** content via OpenRouter (`text-embedding-3-small`)
7. **Classify** via LLM (topics, type, people, action items)
8. **Upsert** into Supabase with SHA-256 [content fingerprint dedup](../../recipes/content-fingerprint-dedup/) — re-running produces zero duplicates

### What gets filtered out

Expand All @@ -109,13 +112,21 @@ deno run --allow-net --allow-read --allow-write --allow-env pull-gmail.ts --list
- Emails with <10 words after cleanup
- Quoted replies and email signatures are stripped before ingestion

### What gets redacted or skipped

By default, the importer runs a pre-ingest redaction pass before embeddings and storage. It masks high-risk strings such as API keys, bearer tokens, connection strings with embedded credentials, SSNs, and similar values that create unnecessary blast radius if stored raw.

Some payloads are too risky to keep even with partial masking. If the importer detects a private key block, it skips that email entirely instead of storing a redacted version.

If you copied this recipe folder out of the repo, keep the [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive folder with it. The script reads the canonical `patterns.json` from that primitive at runtime.

## Expected Outcome

Each imported email becomes one row in the `thoughts` table:
- `content`: Email body with context prefix (`[Email from X | Subject: Y | Date: Z]`)
- `embedding`: 1536-dim vector for semantic search (truncated to 8K chars)
- `metadata`: LLM-extracted topics, type, people, action items, plus `source: "gmail"`, `gmail_id`, `gmail_labels`, `gmail_thread_id`
- `content_fingerprint`: Normalized SHA-256 hash for dedup (see [content fingerprint primitive](../../primitives/content-fingerprint-dedup/README.md))
- `metadata`: LLM-extracted topics, type, people, action items, plus `source: "gmail"`, `gmail_id`, `gmail_labels`, `gmail_thread_id`, and redaction metadata when any replacements were applied
- `content_fingerprint`: Normalized SHA-256 hash for dedup (see [content fingerprint dedup recipe](../../recipes/content-fingerprint-dedup/))

## Troubleshooting

Expand All @@ -126,3 +137,5 @@ Each imported email becomes one row in the `thoughts` table:
**Re-running imports the same emails:** The `sync-log.json` file tracks imported Gmail IDs. Delete it to re-import everything. Content fingerprints provide a second layer of dedup at the database level.

**Embedding/metadata errors:** Verify your `OPENROUTER_API_KEY` has credits. The script calls OpenRouter for both embedding generation and metadata extraction.

**Redaction policy file missing:** Keep the repo structure intact, or copy `primitives/sensitive-data-redaction/` alongside this recipe folder. The importer reads that primitive's `patterns.json` at runtime.
3 changes: 2 additions & 1 deletion recipes/email-history-import/metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,10 @@
"services": ["Gmail API"],
"tools": ["Deno"]
},
"requires_primitives": ["sensitive-data-redaction"],
"tags": ["email", "gmail", "import", "history"],
"difficulty": "intermediate",
"estimated_time": "30 minutes",
"created": "2026-03-10",
"updated": "2026-03-10"
"updated": "2026-04-01"
}
Loading
Loading