NateBJones-Projects · justfinethanku · Apr 1, 2026
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -61,6 +61,27 @@ Your contribution's README must include these sections:
 4. **Expected outcome** — What should the user see when it's working? Be specific.
 5. **Troubleshooting** — At least 2-3 common issues and how to fix them.
 
+### Sensitive Text Ingestion Rule
+
+If your contribution imports or captures raw text that will be embedded or stored in Open Brain, it must use the [Sensitive Data Redaction](primitives/sensitive-data-redaction/) primitive.
+
+This applies to:
+- Email importers
+- Chat export importers
+- Social, blog, and document importers
+- Bulk capture pipelines that ingest third-party text
+
+This does not apply to:
+- Dashboards
+- Schema-only contributions
+- Metadata backfills or analytics jobs that do not ingest new raw text
+
+If this rule applies, your contribution must:
+- Add `"requires_primitives": ["sensitive-data-redaction"]` to `metadata.json`
+- Link the primitive in the README
+- Apply redaction before embeddings and before database insert
+- Default the redaction pass to on, even if you expose an explicit opt-out flag
+
 ### Visual Formatting Requirements
 
 These patterns are required for **extensions** and strongly recommended for all other contributions. They match the [Getting Started guide](docs/01-getting-started.md) and make guides scannable, beginner-friendly, and consistent across the repo.
@@ -237,6 +258,30 @@ Example for a recipe that depends on a reusable skill:
 }
 ```
 
+Example for a raw-text ingestion recipe that depends on the redaction primitive:
+
+```json
+{
+  "name": "Email History Import",
+  "description": "Import your Gmail history into Open Brain as searchable thoughts.",
+  "category": "recipes",
+  "author": {
+    "name": "Your Name",
+    "github": "your-github-username"
+  },
+  "version": "1.0.0",
+  "requires": {
+    "open_brain": true,
+    "services": ["Gmail API"],
+    "tools": ["Deno"]
+  },
+  "requires_primitives": ["sensitive-data-redaction"],
+  "tags": ["email", "gmail", "import"],
+  "difficulty": "intermediate",
+  "estimated_time": "30 minutes"
+}
+```
+
 ## PR Format
 
 **Title:** `[category] Short description`
@@ -298,3 +343,5 @@ Every PR is checked against these rules. All must pass before human review.
 13. **Internal links** — All relative links in READMEs resolve to existing files
 14. **Remote MCP pattern** — Extensions and integrations must use remote MCP via Supabase Edge Functions. No `claude_desktop_config.json`, no local Node.js stdio servers. See the [Getting Started guide](docs/01-getting-started.md) for the correct pattern
 15. **Tool audit link** — Extensions and integrations must link to the [MCP Tool Audit & Optimization Guide](docs/05-tool-audit.md) in their README. This ensures users are aware of tool surface area management as they add capabilities
+
+For ingestion contributions, human review will also check that the [Sensitive Data Redaction](primitives/sensitive-data-redaction/) primitive is declared and applied before embeddings/storage.
diff --git a/primitives/README.md b/primitives/README.md
@@ -11,6 +11,7 @@ Primitives are reusable concept guides that show up in multiple extensions. Lear
 | [Common Troubleshooting](troubleshooting/) | Solutions for connection, deployment, and database issues | All extensions |
 | [Row Level Security](rls/) | PostgreSQL policies for multi-user data isolation | Extensions 4, 5, 6 |
 | [Shared MCP Server](shared-mcp/) | Giving others scoped access to parts of your brain | Extension 4 |
+| [Sensitive Data Redaction](sensitive-data-redaction/) | Pre-ingest masking and skipping of secrets before storage or embeddings | Email History Import, Obsidian Vault Import |
 
 ## How Primitives Work
 

diff --git a/primitives/sensitive-data-redaction/README.md b/primitives/sensitive-data-redaction/README.md
@@ -0,0 +1,97 @@
+# Sensitive Data Redaction
+
+> A standard pre-ingest pass for masking or skipping sensitive strings before external text is embedded or stored in Open Brain.
+
+## What It Is
+
+Sensitive Data Redaction is the baseline safety layer for ingestion contributions. Its job is simple: preserve the useful context of imported text while removing exact strings that create unnecessary risk if they land in embeddings, stored content, logs, exports, or downstream AI retrieval.
+
+This primitive is not a full enterprise DLP system. It is a deterministic, maintainable default for a solo-operator stack. When a contribution imports raw external or user-authored text into Open Brain, it should run this pass before embedding and before database insert.
+
+## Why It Matters
+
+Most imported content is valuable because of its meaning, not because it contains exact credentials or high-risk identifiers. An email that says a client shared a production Stripe key is useful memory. The exact Stripe key is not useful memory. It is a liability.
+
+That distinction is the policy:
+
+- Keep semantic context.
+- Remove exact secrets.
+- Skip content entirely when the payload is too dangerous to keep, such as private key blocks.
+
+This protects the obvious high-risk cases without turning Open Brain into a sterile archive. Your AI still remembers what happened. It just does not keep live credentials around when a placeholder will do.
+
+## What Must Require This Primitive
+
+Any recipe, integration, or extension that imports, syncs, scrapes, forwards, summarizes, or bulk-captures raw text before storage or embedding must declare this primitive in `metadata.json` and link it in its README.
+
+That includes:
+
+- Email and inbox importers
+- Chat export importers
+- Social, blog, and document importers
+- Automated capture pipelines that ingest raw third-party text
+
+That does not include:
+
+- Dashboards
+- Schema-only contributions
+- Analytics or metadata backfills that do not ingest new raw text
+
+## How It Works
+
+The primitive ships a canonical `patterns.json` file with deterministic regex rules and two actions:
+
+- `redact`: replace the exact sensitive string with a placeholder such as `[REDACTED_API_KEY]`
+- `skip`: reject the content entirely because partial masking is not enough
+
+The intended pipeline is:
+
+1. Normalize and clean imported text.
+2. Run sensitive-data redaction.
+3. If the content is marked `skip`, do not embed or insert it.
+4. If the content is redacted, embed and store the redacted version.
+5. Record redaction labels/counts in metadata when helpful.
+
+## Common Patterns
+
+### Redact In Place
+
+Use redaction for API keys, bearer tokens, connection strings with embedded credentials, SSNs, reset links, and other exact strings that create blast radius if retrieved verbatim later.
+
+### Skip Entire Content
+
+Use skip rules for private key blocks and similar payloads where storing a partially masked version still creates too much risk or too little value.
+
+## Step-by-Step Guide
+
+1. Add `"requires_primitives": ["sensitive-data-redaction"]` to the contribution metadata.
+2. Link this primitive in the contribution README prerequisites or ingestion section.
+3. Apply the policy before embeddings and before database insert.
+4. Default the redaction pass to on. If you expose an opt-out flag, make it explicit and clearly marked as not recommended.
+5. Log what happened. At minimum, report redacted counts and skipped items so users can sanity-check imports.
+
+## Expected Outcome
+
+An ingestion contribution that uses this primitive keeps the useful meaning of imported content while masking exact secrets. Users can still search and retrieve context, but high-risk strings do not get embedded or stored verbatim by default. A dry run should make it obvious what would be redacted and what would be skipped.
+
+## Troubleshooting
+
+**Issue: The scanner flags a false positive**
+Solution: Keep the rule set deterministic and conservative. If a specific importer needs an override flag, expose one explicitly and document the tradeoff.
+
+**Issue: A recipe fails because `patterns.json` is missing**
+Solution: The contribution depends on this primitive. Keep the repo structure intact, or copy the `primitives/sensitive-data-redaction/` folder alongside the recipe when running it standalone.
+
+**Issue: Users complain that too much context is removed**
+Solution: The rule set should bias toward placeholder replacement, not blanket deletion. If a rule is dropping useful content, change it from `skip` to `redact` or tighten the regex.
+
+## Extensions That Use This
+
+- Future ingestion-focused extensions should use this primitive as their default policy layer.
+- Today the policy is already wired into [Email History Import](../../recipes/email-history-import/) and [Obsidian Vault Import](../../recipes/obsidian-vault-import/).
+
+## Further Reading
+
+- [Contributing Guide](../../CONTRIBUTING.md)
+- [Email History Import](../../recipes/email-history-import/)
+- [Obsidian Vault Import](../../recipes/obsidian-vault-import/)
diff --git a/primitives/sensitive-data-redaction/metadata.json b/primitives/sensitive-data-redaction/metadata.json
@@ -0,0 +1,20 @@
+{
+  "name": "Sensitive Data Redaction",
+  "description": "A standard pre-ingest redaction pass for secrets and high-risk identifiers before external text is embedded or stored in Open Brain.",
+  "category": "primitives",
+  "author": {
+    "name": "Nate B. Jones",
+    "github": "NateBJones"
+  },
+  "version": "1.0.0",
+  "requires": {
+    "open_brain": true,
+    "services": [],
+    "tools": []
+  },
+  "tags": ["security", "privacy", "redaction", "ingestion", "sensitive-data"],
+  "difficulty": "intermediate",
+  "estimated_time": "20 minutes",
+  "created": "2026-04-01",
+  "updated": "2026-04-01"
+}
diff --git a/primitives/sensitive-data-redaction/patterns.json b/primitives/sensitive-data-redaction/patterns.json
@@ -0,0 +1,93 @@
+{
+  "version": "1.0.0",
+  "rules": [
+    {
+      "label": "Private key block",
+      "action": "skip",
+      "pattern": "-----BEGIN [A-Z ]+ PRIVATE KEY-----",
+      "flags": "i"
+    },
+    {
+      "label": "OpenAI or OpenRouter API key",
+      "action": "redact",
+      "placeholder": "[REDACTED_API_KEY]",
+      "pattern": "sk-(?:or-v1-|proj-|live-)?[A-Za-z0-9]{20,}"
+    },
+    {
+      "label": "Stripe secret key",
+      "action": "redact",
+      "placeholder": "[REDACTED_API_KEY]",
+      "pattern": "sk_(?:live|test)_[A-Za-z0-9]{16,}"
+    },
+    {
+      "label": "Google API key",
+      "action": "redact",
+      "placeholder": "[REDACTED_API_KEY]",
+      "pattern": "AIza[0-9A-Za-z\\-_]{35}"
+    },
+    {
+      "label": "JWT token",
+      "action": "redact",
+      "placeholder": "[REDACTED_JWT]",
+      "pattern": "eyJ[A-Za-z0-9_-]{10,}\\.[A-Za-z0-9._-]{10,}\\.[A-Za-z0-9._-]{10,}"
+    },
+    {
+      "label": "GitHub token",
+      "action": "redact",
+      "placeholder": "[REDACTED_GITHUB_TOKEN]",
+      "pattern": "gh(?:p|s|o|u|r)_[A-Za-z0-9]{20,}"
+    },
+    {
+      "label": "Slack token",
+      "action": "redact",
+      "placeholder": "[REDACTED_SLACK_TOKEN]",
+      "pattern": "xox(?:b|p|a|o|r|s)-[A-Za-z0-9-]{10,}"
+    },
+    {
+      "label": "AWS access key",
+      "action": "redact",
+      "placeholder": "[REDACTED_AWS_ACCESS_KEY]",
+      "pattern": "AKIA[0-9A-Z]{16}"
+    },
+    {
+      "label": "Supabase secret key",
+      "action": "redact",
+      "placeholder": "[REDACTED_SUPABASE_SECRET]",
+      "pattern": "sb_secret_[A-Za-z0-9]+"
+    },
+    {
+      "label": "Bearer token",
+      "action": "redact",
+      "placeholder": "[REDACTED_BEARER_TOKEN]",
+      "pattern": "Bearer\\s+[A-Za-z0-9._~+\\/-]{20,}",
+      "flags": "i"
+    },
+    {
+      "label": "Database connection string with credentials",
+      "action": "redact",
+      "placeholder": "[REDACTED_DB_CREDENTIALS]",
+      "pattern": "(?:postgres|postgresql|mysql|mongodb|redis):\\/\\/[^\\s:@/]+:[^\\s@/]+@",
+      "flags": "i"
+    },
+    {
+      "label": "Generic secret assignment",
+      "action": "redact",
+      "placeholder": "[REDACTED_SECRET]",
+      "pattern": "(?:password|passwd|secret|token|api[_-]?key|apikey|api[_-]?secret|access[_-]?token|auth[_-]?token)\\s*[:=]\\s*[\"']?[A-Za-z0-9_\\-./]{12,}",
+      "flags": "i"
+    },
+    {
+      "label": "URL token parameter",
+      "action": "redact",
+      "placeholder": "[REDACTED_URL_SECRET]",
+      "pattern": "https?:\\/\\/[^\\s]+[?&](?:token|code|access_token|refresh_token|api_key|apikey|auth|sig|signature)=[^\\s&#]+",
+      "flags": "i"
+    },
+    {
+      "label": "US social security number",
+      "action": "redact",
+      "placeholder": "[REDACTED_SSN]",
+      "pattern": "\\b\\d{3}-\\d{2}-\\d{4}\\b"
+    }
+  ]
+}
diff --git a/recipes/README.md b/recipes/README.md
@@ -2,7 +2,7 @@
 
 https://github.com/user-attachments/assets/9454662f-2648-4928-8723-f7d52e94e9b8
 
-Step-by-step builds that add a new capability to your Open Brain. Follow the instructions, run the code, get a new feature. Some recipes depend on canonical skill packs in [`skills/`](../skills/); those recipes should install the skill first, then use the recipe for workflow and composition.
+Step-by-step builds that add a new capability to your Open Brain. Follow the instructions, run the code, get a new feature. Some recipes depend on canonical skill packs in [`skills/`](../skills/), and raw-text ingestion recipes may also depend on primitives in [`primitives/`](../primitives/) such as [Sensitive Data Redaction](../primitives/sensitive-data-redaction/).
 
 | Recipe | What It Does |
 | ------ | ------------ |

diff --git a/recipes/_template/README.md b/recipes/_template/README.md
@@ -11,6 +11,7 @@
 - Working Open Brain setup ([guide](../../docs/01-getting-started.md))
 - List any additional requirements (API keys, tools, services)
 - If this recipe depends on a reusable skill from `skills/`, link it here and declare it in `metadata.json` via `requires_skills`
+- If this recipe imports raw text for storage or embeddings, link the [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive here and declare it in `metadata.json` via `requires_primitives`
 
 ## Credential Tracker
 

diff --git a/recipes/_template/metadata.json b/recipes/_template/metadata.json
@@ -12,6 +12,7 @@
     "services": [],
     "tools": []
   },
+  "requires_primitives": [],
   "requires_skills": [],
   "tags": ["tag1", "tag2"],
   "difficulty": "beginner",

diff --git a/recipes/email-history-import/README.md b/recipes/email-history-import/README.md
@@ -17,6 +17,7 @@ Pulls your Gmail history via the Gmail API and loads each email into Open Brain
 - Google Cloud project with Gmail API enabled
 - Gmail API OAuth credentials (Client ID + Client Secret)
 - OpenRouter API key (same one from your Open Brain setup)
+- [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive (required for the default pre-ingest masking pass)
 
 ## Credential Tracker
 
@@ -85,6 +86,7 @@ deno run --allow-net --allow-read --allow-write --allow-env pull-gmail.ts --list
 | `--dry-run` | off | Preview without ingesting |
 | `--list-labels` | off | List all Gmail labels and exit |
 | `--ingest-endpoint` | off | Use `INGEST_URL`/`INGEST_KEY` instead of Supabase direct insert |
+| `--no-redact` | off | Disable sensitive-data redaction before embedding/storage (not recommended) |
 
 ### Ingestion modes
 
@@ -96,11 +98,12 @@ deno run --allow-net --allow-read --allow-write --allow-env pull-gmail.ts --list
 
 1. **Fetch** emails from Gmail API by label and time window
 2. **Extract** body (base64 decode, HTML-to-text, strip quoted replies and signatures)
-3. **Filter** out noise (no-reply senders, receipts, auto-generated, <10 words)
-4. **Deduplicate** via sync-log (tracks Gmail message IDs already imported)
-5. **Embed** content via OpenRouter (`text-embedding-3-small`)
-6. **Classify** via LLM (topics, type, people, action items)
-7. **Upsert** into Supabase with SHA-256 [content fingerprint](../../primitives/content-fingerprint-dedup/README.md) — re-running produces zero duplicates
+3. **Redact** sensitive strings via the [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive
+4. **Filter** out noise (no-reply senders, receipts, auto-generated, <10 words)
+5. **Deduplicate** via sync-log (tracks Gmail message IDs already imported)
+6. **Embed** content via OpenRouter (`text-embedding-3-small`)
+7. **Classify** via LLM (topics, type, people, action items)
+8. **Upsert** into Supabase with SHA-256 [content fingerprint dedup](../../recipes/content-fingerprint-dedup/) — re-running produces zero duplicates
 
 ### What gets filtered out
 
@@ -109,13 +112,21 @@ deno run --allow-net --allow-read --allow-write --allow-env pull-gmail.ts --list
 - Emails with <10 words after cleanup
 - Quoted replies and email signatures are stripped before ingestion
 
+### What gets redacted or skipped
+
+By default, the importer runs a pre-ingest redaction pass before embeddings and storage. It masks high-risk strings such as API keys, bearer tokens, connection strings with embedded credentials, SSNs, and similar values that create unnecessary blast radius if stored raw.
+
+Some payloads are too risky to keep even with partial masking. If the importer detects a private key block, it skips that email entirely instead of storing a redacted version.
+
+If you copied this recipe folder out of the repo, keep the [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive folder with it. The script reads the canonical `patterns.json` from that primitive at runtime.
+
 ## Expected Outcome
 
 Each imported email becomes one row in the `thoughts` table:
 - `content`: Email body with context prefix (`[Email from X | Subject: Y | Date: Z]`)
 - `embedding`: 1536-dim vector for semantic search (truncated to 8K chars)
-- `metadata`: LLM-extracted topics, type, people, action items, plus `source: "gmail"`, `gmail_id`, `gmail_labels`, `gmail_thread_id`
-- `content_fingerprint`: Normalized SHA-256 hash for dedup (see [content fingerprint primitive](../../primitives/content-fingerprint-dedup/README.md))
+- `metadata`: LLM-extracted topics, type, people, action items, plus `source: "gmail"`, `gmail_id`, `gmail_labels`, `gmail_thread_id`, and redaction metadata when any replacements were applied
+- `content_fingerprint`: Normalized SHA-256 hash for dedup (see [content fingerprint dedup recipe](../../recipes/content-fingerprint-dedup/))
 
 ## Troubleshooting
 
@@ -126,3 +137,5 @@ Each imported email becomes one row in the `thoughts` table:
 **Re-running imports the same emails:** The `sync-log.json` file tracks imported Gmail IDs. Delete it to re-import everything. Content fingerprints provide a second layer of dedup at the database level.
 
 **Embedding/metadata errors:** Verify your `OPENROUTER_API_KEY` has credits. The script calls OpenRouter for both embedding generation and metadata extraction.
+
+**Redaction policy file missing:** Keep the repo structure intact, or copy `primitives/sensitive-data-redaction/` alongside this recipe folder. The importer reads that primitive's `patterns.json` at runtime.
diff --git a/recipes/email-history-import/metadata.json b/recipes/email-history-import/metadata.json
@@ -12,9 +12,10 @@
     "services": ["Gmail API"],
     "tools": ["Deno"]
   },
+  "requires_primitives": ["sensitive-data-redaction"],
   "tags": ["email", "gmail", "import", "history"],
   "difficulty": "intermediate",
   "estimated_time": "30 minutes",
   "created": "2026-03-10",
-  "updated": "2026-03-10"
+  "updated": "2026-04-01"
 }