diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 780f837f..3a9d1c0a 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -61,6 +61,27 @@ Your contribution's README must include these sections: 4. **Expected outcome** — What should the user see when it's working? Be specific. 5. **Troubleshooting** — At least 2-3 common issues and how to fix them. +### Sensitive Text Ingestion Rule + +If your contribution imports or captures raw text that will be embedded or stored in Open Brain, it must use the [Sensitive Data Redaction](primitives/sensitive-data-redaction/) primitive. + +This applies to: +- Email importers +- Chat export importers +- Social, blog, and document importers +- Bulk capture pipelines that ingest third-party text + +This does not apply to: +- Dashboards +- Schema-only contributions +- Metadata backfills or analytics jobs that do not ingest new raw text + +If this rule applies, your contribution must: +- Add `"requires_primitives": ["sensitive-data-redaction"]` to `metadata.json` +- Link the primitive in the README +- Apply redaction before embeddings and before database insert +- Default the redaction pass to on, even if you expose an explicit opt-out flag + ### Visual Formatting Requirements These patterns are required for **extensions** and strongly recommended for all other contributions. They match the [Getting Started guide](docs/01-getting-started.md) and make guides scannable, beginner-friendly, and consistent across the repo. @@ -237,6 +258,30 @@ Example for a recipe that depends on a reusable skill: } ``` +Example for a raw-text ingestion recipe that depends on the redaction primitive: + +```json +{ + "name": "Email History Import", + "description": "Import your Gmail history into Open Brain as searchable thoughts.", + "category": "recipes", + "author": { + "name": "Your Name", + "github": "your-github-username" + }, + "version": "1.0.0", + "requires": { + "open_brain": true, + "services": ["Gmail API"], + "tools": ["Deno"] + }, + "requires_primitives": ["sensitive-data-redaction"], + "tags": ["email", "gmail", "import"], + "difficulty": "intermediate", + "estimated_time": "30 minutes" +} +``` + ## PR Format **Title:** `[category] Short description` @@ -298,3 +343,5 @@ Every PR is checked against these rules. All must pass before human review. 13. **Internal links** — All relative links in READMEs resolve to existing files 14. **Remote MCP pattern** — Extensions and integrations must use remote MCP via Supabase Edge Functions. No `claude_desktop_config.json`, no local Node.js stdio servers. See the [Getting Started guide](docs/01-getting-started.md) for the correct pattern 15. **Tool audit link** — Extensions and integrations must link to the [MCP Tool Audit & Optimization Guide](docs/05-tool-audit.md) in their README. This ensures users are aware of tool surface area management as they add capabilities + +For ingestion contributions, human review will also check that the [Sensitive Data Redaction](primitives/sensitive-data-redaction/) primitive is declared and applied before embeddings/storage. diff --git a/primitives/README.md b/primitives/README.md index fd05485b..563081a6 100644 --- a/primitives/README.md +++ b/primitives/README.md @@ -11,6 +11,7 @@ Primitives are reusable concept guides that show up in multiple extensions. Lear | [Common Troubleshooting](troubleshooting/) | Solutions for connection, deployment, and database issues | All extensions | | [Row Level Security](rls/) | PostgreSQL policies for multi-user data isolation | Extensions 4, 5, 6 | | [Shared MCP Server](shared-mcp/) | Giving others scoped access to parts of your brain | Extension 4 | +| [Sensitive Data Redaction](sensitive-data-redaction/) | Pre-ingest masking and skipping of secrets before storage or embeddings | Email History Import, Obsidian Vault Import | ## How Primitives Work diff --git a/primitives/sensitive-data-redaction/README.md b/primitives/sensitive-data-redaction/README.md new file mode 100644 index 00000000..002104c2 --- /dev/null +++ b/primitives/sensitive-data-redaction/README.md @@ -0,0 +1,97 @@ +# Sensitive Data Redaction + +> A standard pre-ingest pass for masking or skipping sensitive strings before external text is embedded or stored in Open Brain. + +## What It Is + +Sensitive Data Redaction is the baseline safety layer for ingestion contributions. Its job is simple: preserve the useful context of imported text while removing exact strings that create unnecessary risk if they land in embeddings, stored content, logs, exports, or downstream AI retrieval. + +This primitive is not a full enterprise DLP system. It is a deterministic, maintainable default for a solo-operator stack. When a contribution imports raw external or user-authored text into Open Brain, it should run this pass before embedding and before database insert. + +## Why It Matters + +Most imported content is valuable because of its meaning, not because it contains exact credentials or high-risk identifiers. An email that says a client shared a production Stripe key is useful memory. The exact Stripe key is not useful memory. It is a liability. + +That distinction is the policy: + +- Keep semantic context. +- Remove exact secrets. +- Skip content entirely when the payload is too dangerous to keep, such as private key blocks. + +This protects the obvious high-risk cases without turning Open Brain into a sterile archive. Your AI still remembers what happened. It just does not keep live credentials around when a placeholder will do. + +## What Must Require This Primitive + +Any recipe, integration, or extension that imports, syncs, scrapes, forwards, summarizes, or bulk-captures raw text before storage or embedding must declare this primitive in `metadata.json` and link it in its README. + +That includes: + +- Email and inbox importers +- Chat export importers +- Social, blog, and document importers +- Automated capture pipelines that ingest raw third-party text + +That does not include: + +- Dashboards +- Schema-only contributions +- Analytics or metadata backfills that do not ingest new raw text + +## How It Works + +The primitive ships a canonical `patterns.json` file with deterministic regex rules and two actions: + +- `redact`: replace the exact sensitive string with a placeholder such as `[REDACTED_API_KEY]` +- `skip`: reject the content entirely because partial masking is not enough + +The intended pipeline is: + +1. Normalize and clean imported text. +2. Run sensitive-data redaction. +3. If the content is marked `skip`, do not embed or insert it. +4. If the content is redacted, embed and store the redacted version. +5. Record redaction labels/counts in metadata when helpful. + +## Common Patterns + +### Redact In Place + +Use redaction for API keys, bearer tokens, connection strings with embedded credentials, SSNs, reset links, and other exact strings that create blast radius if retrieved verbatim later. + +### Skip Entire Content + +Use skip rules for private key blocks and similar payloads where storing a partially masked version still creates too much risk or too little value. + +## Step-by-Step Guide + +1. Add `"requires_primitives": ["sensitive-data-redaction"]` to the contribution metadata. +2. Link this primitive in the contribution README prerequisites or ingestion section. +3. Apply the policy before embeddings and before database insert. +4. Default the redaction pass to on. If you expose an opt-out flag, make it explicit and clearly marked as not recommended. +5. Log what happened. At minimum, report redacted counts and skipped items so users can sanity-check imports. + +## Expected Outcome + +An ingestion contribution that uses this primitive keeps the useful meaning of imported content while masking exact secrets. Users can still search and retrieve context, but high-risk strings do not get embedded or stored verbatim by default. A dry run should make it obvious what would be redacted and what would be skipped. + +## Troubleshooting + +**Issue: The scanner flags a false positive** +Solution: Keep the rule set deterministic and conservative. If a specific importer needs an override flag, expose one explicitly and document the tradeoff. + +**Issue: A recipe fails because `patterns.json` is missing** +Solution: The contribution depends on this primitive. Keep the repo structure intact, or copy the `primitives/sensitive-data-redaction/` folder alongside the recipe when running it standalone. + +**Issue: Users complain that too much context is removed** +Solution: The rule set should bias toward placeholder replacement, not blanket deletion. If a rule is dropping useful content, change it from `skip` to `redact` or tighten the regex. + +## Extensions That Use This + +- Future ingestion-focused extensions should use this primitive as their default policy layer. +- Today the policy is already wired into [Email History Import](../../recipes/email-history-import/) and [Obsidian Vault Import](../../recipes/obsidian-vault-import/). + +## Further Reading + +- [Contributing Guide](../../CONTRIBUTING.md) +- [Email History Import](../../recipes/email-history-import/) +- [Obsidian Vault Import](../../recipes/obsidian-vault-import/) diff --git a/primitives/sensitive-data-redaction/metadata.json b/primitives/sensitive-data-redaction/metadata.json new file mode 100644 index 00000000..159f8086 --- /dev/null +++ b/primitives/sensitive-data-redaction/metadata.json @@ -0,0 +1,20 @@ +{ + "name": "Sensitive Data Redaction", + "description": "A standard pre-ingest redaction pass for secrets and high-risk identifiers before external text is embedded or stored in Open Brain.", + "category": "primitives", + "author": { + "name": "Nate B. Jones", + "github": "NateBJones" + }, + "version": "1.0.0", + "requires": { + "open_brain": true, + "services": [], + "tools": [] + }, + "tags": ["security", "privacy", "redaction", "ingestion", "sensitive-data"], + "difficulty": "intermediate", + "estimated_time": "20 minutes", + "created": "2026-04-01", + "updated": "2026-04-01" +} diff --git a/primitives/sensitive-data-redaction/patterns.json b/primitives/sensitive-data-redaction/patterns.json new file mode 100644 index 00000000..7c3c4098 --- /dev/null +++ b/primitives/sensitive-data-redaction/patterns.json @@ -0,0 +1,93 @@ +{ + "version": "1.0.0", + "rules": [ + { + "label": "Private key block", + "action": "skip", + "pattern": "-----BEGIN [A-Z ]+ PRIVATE KEY-----", + "flags": "i" + }, + { + "label": "OpenAI or OpenRouter API key", + "action": "redact", + "placeholder": "[REDACTED_API_KEY]", + "pattern": "sk-(?:or-v1-|proj-|live-)?[A-Za-z0-9]{20,}" + }, + { + "label": "Stripe secret key", + "action": "redact", + "placeholder": "[REDACTED_API_KEY]", + "pattern": "sk_(?:live|test)_[A-Za-z0-9]{16,}" + }, + { + "label": "Google API key", + "action": "redact", + "placeholder": "[REDACTED_API_KEY]", + "pattern": "AIza[0-9A-Za-z\\-_]{35}" + }, + { + "label": "JWT token", + "action": "redact", + "placeholder": "[REDACTED_JWT]", + "pattern": "eyJ[A-Za-z0-9_-]{10,}\\.[A-Za-z0-9._-]{10,}\\.[A-Za-z0-9._-]{10,}" + }, + { + "label": "GitHub token", + "action": "redact", + "placeholder": "[REDACTED_GITHUB_TOKEN]", + "pattern": "gh(?:p|s|o|u|r)_[A-Za-z0-9]{20,}" + }, + { + "label": "Slack token", + "action": "redact", + "placeholder": "[REDACTED_SLACK_TOKEN]", + "pattern": "xox(?:b|p|a|o|r|s)-[A-Za-z0-9-]{10,}" + }, + { + "label": "AWS access key", + "action": "redact", + "placeholder": "[REDACTED_AWS_ACCESS_KEY]", + "pattern": "AKIA[0-9A-Z]{16}" + }, + { + "label": "Supabase secret key", + "action": "redact", + "placeholder": "[REDACTED_SUPABASE_SECRET]", + "pattern": "sb_secret_[A-Za-z0-9]+" + }, + { + "label": "Bearer token", + "action": "redact", + "placeholder": "[REDACTED_BEARER_TOKEN]", + "pattern": "Bearer\\s+[A-Za-z0-9._~+\\/-]{20,}", + "flags": "i" + }, + { + "label": "Database connection string with credentials", + "action": "redact", + "placeholder": "[REDACTED_DB_CREDENTIALS]", + "pattern": "(?:postgres|postgresql|mysql|mongodb|redis):\\/\\/[^\\s:@/]+:[^\\s@/]+@", + "flags": "i" + }, + { + "label": "Generic secret assignment", + "action": "redact", + "placeholder": "[REDACTED_SECRET]", + "pattern": "(?:password|passwd|secret|token|api[_-]?key|apikey|api[_-]?secret|access[_-]?token|auth[_-]?token)\\s*[:=]\\s*[\"']?[A-Za-z0-9_\\-./]{12,}", + "flags": "i" + }, + { + "label": "URL token parameter", + "action": "redact", + "placeholder": "[REDACTED_URL_SECRET]", + "pattern": "https?:\\/\\/[^\\s]+[?&](?:token|code|access_token|refresh_token|api_key|apikey|auth|sig|signature)=[^\\s&#]+", + "flags": "i" + }, + { + "label": "US social security number", + "action": "redact", + "placeholder": "[REDACTED_SSN]", + "pattern": "\\b\\d{3}-\\d{2}-\\d{4}\\b" + } + ] +} diff --git a/recipes/README.md b/recipes/README.md index c46c8bb2..6312d155 100644 --- a/recipes/README.md +++ b/recipes/README.md @@ -2,7 +2,7 @@ https://github.com/user-attachments/assets/9454662f-2648-4928-8723-f7d52e94e9b8 -Step-by-step builds that add a new capability to your Open Brain. Follow the instructions, run the code, get a new feature. Some recipes depend on canonical skill packs in [`skills/`](../skills/); those recipes should install the skill first, then use the recipe for workflow and composition. +Step-by-step builds that add a new capability to your Open Brain. Follow the instructions, run the code, get a new feature. Some recipes depend on canonical skill packs in [`skills/`](../skills/), and raw-text ingestion recipes may also depend on primitives in [`primitives/`](../primitives/) such as [Sensitive Data Redaction](../primitives/sensitive-data-redaction/). | Recipe | What It Does | | ------ | ------------ | diff --git a/recipes/_template/README.md b/recipes/_template/README.md index b1d8deab..849e37a3 100644 --- a/recipes/_template/README.md +++ b/recipes/_template/README.md @@ -11,6 +11,7 @@ - Working Open Brain setup ([guide](../../docs/01-getting-started.md)) - List any additional requirements (API keys, tools, services) - If this recipe depends on a reusable skill from `skills/`, link it here and declare it in `metadata.json` via `requires_skills` +- If this recipe imports raw text for storage or embeddings, link the [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive here and declare it in `metadata.json` via `requires_primitives` ## Credential Tracker diff --git a/recipes/_template/metadata.json b/recipes/_template/metadata.json index 78185104..dd515f5a 100644 --- a/recipes/_template/metadata.json +++ b/recipes/_template/metadata.json @@ -12,6 +12,7 @@ "services": [], "tools": [] }, + "requires_primitives": [], "requires_skills": [], "tags": ["tag1", "tag2"], "difficulty": "beginner", diff --git a/recipes/email-history-import/README.md b/recipes/email-history-import/README.md index 40142e42..3d3722a6 100644 --- a/recipes/email-history-import/README.md +++ b/recipes/email-history-import/README.md @@ -17,6 +17,7 @@ Pulls your Gmail history via the Gmail API and loads each email into Open Brain - Google Cloud project with Gmail API enabled - Gmail API OAuth credentials (Client ID + Client Secret) - OpenRouter API key (same one from your Open Brain setup) +- [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive (required for the default pre-ingest masking pass) ## Credential Tracker @@ -85,6 +86,7 @@ deno run --allow-net --allow-read --allow-write --allow-env pull-gmail.ts --list | `--dry-run` | off | Preview without ingesting | | `--list-labels` | off | List all Gmail labels and exit | | `--ingest-endpoint` | off | Use `INGEST_URL`/`INGEST_KEY` instead of Supabase direct insert | +| `--no-redact` | off | Disable sensitive-data redaction before embedding/storage (not recommended) | ### Ingestion modes @@ -96,11 +98,12 @@ deno run --allow-net --allow-read --allow-write --allow-env pull-gmail.ts --list 1. **Fetch** emails from Gmail API by label and time window 2. **Extract** body (base64 decode, HTML-to-text, strip quoted replies and signatures) -3. **Filter** out noise (no-reply senders, receipts, auto-generated, <10 words) -4. **Deduplicate** via sync-log (tracks Gmail message IDs already imported) -5. **Embed** content via OpenRouter (`text-embedding-3-small`) -6. **Classify** via LLM (topics, type, people, action items) -7. **Upsert** into Supabase with SHA-256 [content fingerprint](../../primitives/content-fingerprint-dedup/README.md) — re-running produces zero duplicates +3. **Redact** sensitive strings via the [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive +4. **Filter** out noise (no-reply senders, receipts, auto-generated, <10 words) +5. **Deduplicate** via sync-log (tracks Gmail message IDs already imported) +6. **Embed** content via OpenRouter (`text-embedding-3-small`) +7. **Classify** via LLM (topics, type, people, action items) +8. **Upsert** into Supabase with SHA-256 [content fingerprint dedup](../../recipes/content-fingerprint-dedup/) — re-running produces zero duplicates ### What gets filtered out @@ -109,13 +112,21 @@ deno run --allow-net --allow-read --allow-write --allow-env pull-gmail.ts --list - Emails with <10 words after cleanup - Quoted replies and email signatures are stripped before ingestion +### What gets redacted or skipped + +By default, the importer runs a pre-ingest redaction pass before embeddings and storage. It masks high-risk strings such as API keys, bearer tokens, connection strings with embedded credentials, SSNs, and similar values that create unnecessary blast radius if stored raw. + +Some payloads are too risky to keep even with partial masking. If the importer detects a private key block, it skips that email entirely instead of storing a redacted version. + +If you copied this recipe folder out of the repo, keep the [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive folder with it. The script reads the canonical `patterns.json` from that primitive at runtime. + ## Expected Outcome Each imported email becomes one row in the `thoughts` table: - `content`: Email body with context prefix (`[Email from X | Subject: Y | Date: Z]`) - `embedding`: 1536-dim vector for semantic search (truncated to 8K chars) -- `metadata`: LLM-extracted topics, type, people, action items, plus `source: "gmail"`, `gmail_id`, `gmail_labels`, `gmail_thread_id` -- `content_fingerprint`: Normalized SHA-256 hash for dedup (see [content fingerprint primitive](../../primitives/content-fingerprint-dedup/README.md)) +- `metadata`: LLM-extracted topics, type, people, action items, plus `source: "gmail"`, `gmail_id`, `gmail_labels`, `gmail_thread_id`, and redaction metadata when any replacements were applied +- `content_fingerprint`: Normalized SHA-256 hash for dedup (see [content fingerprint dedup recipe](../../recipes/content-fingerprint-dedup/)) ## Troubleshooting @@ -126,3 +137,5 @@ Each imported email becomes one row in the `thoughts` table: **Re-running imports the same emails:** The `sync-log.json` file tracks imported Gmail IDs. Delete it to re-import everything. Content fingerprints provide a second layer of dedup at the database level. **Embedding/metadata errors:** Verify your `OPENROUTER_API_KEY` has credits. The script calls OpenRouter for both embedding generation and metadata extraction. + +**Redaction policy file missing:** Keep the repo structure intact, or copy `primitives/sensitive-data-redaction/` alongside this recipe folder. The importer reads that primitive's `patterns.json` at runtime. diff --git a/recipes/email-history-import/metadata.json b/recipes/email-history-import/metadata.json index b986e1d1..ae5ec7f0 100644 --- a/recipes/email-history-import/metadata.json +++ b/recipes/email-history-import/metadata.json @@ -12,9 +12,10 @@ "services": ["Gmail API"], "tools": ["Deno"] }, + "requires_primitives": ["sensitive-data-redaction"], "tags": ["email", "gmail", "import", "history"], "difficulty": "intermediate", "estimated_time": "30 minutes", "created": "2026-03-10", - "updated": "2026-03-10" + "updated": "2026-04-01" } diff --git a/recipes/email-history-import/pull-gmail.ts b/recipes/email-history-import/pull-gmail.ts index 42b72e76..2b6ad91e 100644 --- a/recipes/email-history-import/pull-gmail.ts +++ b/recipes/email-history-import/pull-gmail.ts @@ -22,6 +22,7 @@ * --limit=N Max emails to process (default: 50) * --list-labels List all Gmail labels and exit * --ingest-endpoint Use INGEST_URL/INGEST_KEY instead of Supabase direct + * --no-redact Disable sensitive-data redaction before ingest */ // ─── Configuration ─────────────────────────────────────────────────────────── @@ -30,6 +31,10 @@ const SCRIPT_DIR = new URL(".", import.meta.url).pathname; const CREDENTIALS_PATH = `${SCRIPT_DIR}credentials.json`; const TOKEN_PATH = `${SCRIPT_DIR}token.json`; const SYNC_LOG_PATH = `${SCRIPT_DIR}sync-log.json`; +const REDACTION_POLICY_URL = new URL( + "../../primitives/sensitive-data-redaction/patterns.json", + import.meta.url, +); const GMAIL_API = "https://gmail.googleapis.com/gmail/v1/users/me"; const SCOPES = ["https://www.googleapis.com/auth/gmail.readonly"]; @@ -75,6 +80,121 @@ async function sha256(text: string): Promise { .join(""); } +// ─── Sensitive Data Redaction ─────────────────────────────────────────────── + +interface SensitiveDataRule { + label: string; + action: "redact" | "skip"; + pattern: string; + placeholder?: string; + flags?: string; +} + +interface SensitiveDataPolicyFile { + version: string; + rules: SensitiveDataRule[]; +} + +interface RedactionFinding { + label: string; + action: "redact" | "skip"; + count: number; +} + +interface RedactionResult { + text: string; + skipped: boolean; + skipLabel?: string; + findings: RedactionFinding[]; + totalRedactions: number; + policyVersion: string; +} + +let sensitiveDataPolicyPromise: Promise | null = null; + +async function loadSensitiveDataPolicy(): Promise { + if (!sensitiveDataPolicyPromise) { + sensitiveDataPolicyPromise = (async () => { + try { + const text = await Deno.readTextFile(REDACTION_POLICY_URL); + const parsed = JSON.parse(text) as Partial; + if (!parsed.rules || !Array.isArray(parsed.rules)) { + throw new Error("Invalid patterns.json format"); + } + return { + version: parsed.version || "unknown", + rules: parsed.rules as SensitiveDataRule[], + }; + } catch (err) { + const detail = err instanceof Error ? err.message : String(err); + throw new Error( + `Sensitive-data redaction requires ${REDACTION_POLICY_URL.pathname}. ` + + `Keep the repo structure intact or pass --no-redact to opt out. (${detail})`, + ); + } + })(); + } + + return await sensitiveDataPolicyPromise; +} + +function buildRuleRegex(rule: SensitiveDataRule, global = false): RegExp { + let flags = rule.flags || ""; + if (global && !flags.includes("g")) flags += "g"; + return new RegExp(rule.pattern, flags); +} + +async function applySensitiveDataPolicy(text: string): Promise { + const policy = await loadSensitiveDataPolicy(); + let current = text; + const findings: RedactionFinding[] = []; + let totalRedactions = 0; + + for (const rule of policy.rules) { + if (rule.action === "skip") { + if (buildRuleRegex(rule).test(current)) { + return { + text: current, + skipped: true, + skipLabel: rule.label, + findings: [{ label: rule.label, action: "skip", count: 1 }], + totalRedactions: 0, + policyVersion: policy.version, + }; + } + continue; + } + + const regex = buildRuleRegex(rule, true); + const matches = current.match(regex); + if (!matches || matches.length === 0) continue; + + current = current.replace(regex, rule.placeholder || "[REDACTED]"); + findings.push({ label: rule.label, action: "redact", count: matches.length }); + totalRedactions += matches.length; + } + + return { + text: current, + skipped: false, + findings, + totalRedactions, + policyVersion: policy.version, + }; +} + +function formatRedactionFindings(findings: RedactionFinding[]): string { + const counts = new Map(); + for (const finding of findings) { + if (finding.action !== "redact") continue; + counts.set(finding.label, (counts.get(finding.label) || 0) + finding.count); + } + + return [...counts.entries()] + .map(([label, count]) => (count > 1 ? `${label} x${count}` : label)) + .join(", "); +} + // ─── CLI Argument Parsing ──────────────────────────────────────────────────── interface CliArgs { @@ -84,6 +204,7 @@ interface CliArgs { limit: number; listLabels: boolean; ingestEndpoint: boolean; + noRedact: boolean; } function parseArgs(): CliArgs { @@ -94,6 +215,7 @@ function parseArgs(): CliArgs { limit: 50, listLabels: false, ingestEndpoint: false, + noRedact: false, }; for (const arg of Deno.args) { @@ -109,6 +231,8 @@ function parseArgs(): CliArgs { args.listLabels = true; } else if (arg === "--ingest-endpoint") { args.ingestEndpoint = true; + } else if (arg === "--no-redact") { + args.noRedact = true; } } @@ -790,6 +914,7 @@ function buildEmailContent( async function main() { const args = parseArgs(); + const redactionPolicyVersion = args.noRedact ? null : (await loadSensitiveDataPolicy()).version; const creds = await loadCredentials(); const accessToken = await authorize(creds); @@ -823,6 +948,11 @@ async function main() { console.log(` Window: ${args.window}${query ? ` (${query})` : ""}`); console.log(` Limit: ${args.limit}`); console.log(` Mode: ${ingestMode}`); + console.log( + ` Redaction: ${ + args.noRedact ? "disabled (--no-redact)" : `Sensitive Data Redaction ${redactionPolicyVersion}` + }`, + ); if (!args.dryRun) { if (useEndpoint) { @@ -856,6 +986,9 @@ async function main() { let ingested = 0; let errors = 0; let totalWords = 0; + let sensitiveSkipped = 0; + let redactedEmails = 0; + let totalRedactionHits = 0; for (const ref of messageRefs) { if (syncLog.ingested_ids[ref.id]) { @@ -871,6 +1004,42 @@ async function main() { continue; } + let redactionCount = 0; + let redactionLabels: string[] = []; + let redactionVersion: string | undefined; + let redactionSummary = ""; + + if (!args.noRedact) { + const [subjectResult, bodyResult] = await Promise.all([ + applySensitiveDataPolicy(email.subject), + applySensitiveDataPolicy(email.body), + ]); + + if (subjectResult.skipped || bodyResult.skipped) { + sensitiveSkipped++; + const skipLabel = subjectResult.skipLabel || bodyResult.skipLabel || "sensitive content"; + console.log(`SKIPPED (sensitive-data-redaction): ${email.subject || "(no subject)"}`); + console.log(` Reason: ${skipLabel}`); + console.log(); + continue; + } + + const findings = [...subjectResult.findings, ...bodyResult.findings]; + redactionCount = subjectResult.totalRedactions + bodyResult.totalRedactions; + redactionLabels = [...new Set(findings.map((finding) => finding.label))]; + redactionVersion = subjectResult.policyVersion; + redactionSummary = formatRedactionFindings(findings); + + email.subject = subjectResult.text; + email.body = bodyResult.text; + email.wordCount = wordCount(email.body); + + if (redactionCount > 0) { + redactedEmails++; + totalRedactionHits += redactionCount; + } + } + processed++; totalWords += email.wordCount; @@ -883,6 +1052,9 @@ async function main() { ` From: ${email.from} | ${email.wordCount} words | ${new Date(email.date).toLocaleDateString()}`, ); console.log(` Labels: ${readableLabels.join(", ")}`); + if (redactionCount > 0) { + console.log(` Redacted: ${redactionSummary}`); + } if (args.dryRun) { console.log(` "${email.body.slice(0, 120)}..."`); @@ -899,6 +1071,12 @@ async function main() { gmail_id: email.gmailId, gmail_thread_id: email.threadId, }; + if (redactionCount > 0) { + emailMeta.redaction_applied = true; + emailMeta.redaction_count = redactionCount; + emailMeta.redaction_labels = redactionLabels; + emailMeta.redaction_version = redactionVersion; + } const content = buildEmailContent(email.body, email.from, email.subject, email.date); const result = useEndpoint @@ -934,6 +1112,12 @@ async function main() { } console.log(` Processed: ${processed}`); console.log(` Skipped (noise): ${skipped}`); + if (sensitiveSkipped > 0) { + console.log(` Skipped (sensitive): ${sensitiveSkipped}`); + } + if (redactedEmails > 0) { + console.log(` Redacted emails: ${redactedEmails} (${totalRedactionHits} replacements)`); + } console.log(` Total words: ${totalWords.toLocaleString()}`); if (!args.dryRun) { console.log(` Ingested: ${ingested}`); diff --git a/recipes/obsidian-vault-import/README.md b/recipes/obsidian-vault-import/README.md index 15ae7202..2c33a335 100644 --- a/recipes/obsidian-vault-import/README.md +++ b/recipes/obsidian-vault-import/README.md @@ -37,6 +37,7 @@ No special configuration is needed for any of these — the script handles them - Python 3.10+ - Your Supabase project URL and API key - OpenRouter API key (for embeddings and optional LLM chunking) +- [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive (required for the default pre-ingest masking pass) - Recommended: add a `content_fingerprint` column and unique index for database-level dedup (see [Re-running and Deduplication](#re-running-and-deduplication)) ## Credential Tracker @@ -107,7 +108,7 @@ FILE LOCATION | `--after DATE` | Only import notes modified after this date (YYYY-MM-DD) | | `--no-llm` | Disable LLM chunking — heading splits only, zero API cost beyond embeddings | | `--no-embed` | Skip embedding generation (insert thoughts without vectors) | -| `--no-secret-scan` | Disable secret detection (not recommended) | +| `--no-redact` | Disable sensitive-data redaction and skip pass (not recommended) | | `--verbose` | Show detailed progress for each note | | `--report` | Generate an `import-report.md` summary file | @@ -137,18 +138,17 @@ The script automatically skips notes that wouldn't make useful thoughts. Run wit python import-obsidian.py /path/to/vault --skip-folders "Archive,Files,patterns" ``` -## Secret Detection +## Sensitive Data Redaction -The script scans each thought for potential secrets before embedding or inserting. Thoughts containing API keys, tokens, passwords, or connection strings are skipped and logged — they never reach your database. +This importer now uses the [Sensitive Data Redaction](../../primitives/sensitive-data-redaction/) primitive before embedding or inserting anything. -Detected patterns include: -- API keys (OpenAI, OpenRouter, AWS, GitHub, Supabase) -- JWT tokens -- Private key blocks -- Connection strings with embedded credentials -- Generic secret assignments (`password=`, `token=`, `api_key=`, etc.) +The policy has two actions: +- **Redact in place** for API keys, bearer tokens, connection strings with embedded credentials, SSNs, and similar exact strings that should not be stored raw +- **Skip entirely** for high-risk payloads like private key blocks -The dry run (`--dry-run`) also runs the scanner, so you can review what would be flagged before a live import. If the scanner flags a false positive, use `--no-secret-scan` to disable it. +That means useful notes keep their meaning, but the dangerous exact value is replaced with a placeholder such as `[REDACTED_API_KEY]`. + +The dry run (`--dry-run`) also runs this pass, so you can review what would be redacted or skipped before a live import. If the scanner flags a false positive or you intentionally want raw import behavior, use `--no-redact` to disable it. ## How Chunking Works @@ -217,6 +217,8 @@ After a successful import, searching your Open Brain for topics from your vault } ``` +If redaction was applied, the thought metadata also includes redaction counts, labels, and the policy version used for that insert. + You can filter by source to find only Obsidian-imported thoughts: search with `{"source": "obsidian"}` as a metadata filter. ## Troubleshooting @@ -239,5 +241,8 @@ Solution: The sync log prevents duplicates on re-runs. For stronger protection, **Issue: Import aborts after "10 consecutive insert failures"** Solution: The script stops early if 10 inserts fail in a row to avoid wasting embedding credits. Check your Supabase connection, verify the `thoughts` table exists, and confirm your API key is correct. The preflight check catches most of these, but a connection drop mid-import can also trigger this. -**Issue: Notes flagged as containing secrets (false positive)** -Solution: Review the flagged content. If it's a false positive (e.g., a note discussing API key formats without containing real keys), re-run with `--no-secret-scan`. The scanner is intentionally conservative — it's better to flag and skip than to store a real secret in your database. +**Issue: Notes flagged by the redaction pass (false positive)** +Solution: Review the flagged content. If it's a false positive (for example, a note discussing API key formats without containing a real secret), re-run with `--no-redact`. The policy is intentionally conservative by default. + +**Issue: Redaction policy file missing** +Solution: Keep the repo structure intact, or copy `primitives/sensitive-data-redaction/` alongside this recipe folder. The importer reads that primitive's `patterns.json` at runtime. diff --git a/recipes/obsidian-vault-import/import-obsidian.py b/recipes/obsidian-vault-import/import-obsidian.py index da55dbb3..e8010db5 100644 --- a/recipes/obsidian-vault-import/import-obsidian.py +++ b/recipes/obsidian-vault-import/import-obsidian.py @@ -62,33 +62,102 @@ MAX_RETRIES = 3 RETRY_BACKOFF = 2 # seconds, doubles each retry -# Secret detection patterns — (label, compiled regex) -SECRET_PATTERNS = [ - ("OpenAI/OpenRouter API key", re.compile(r'sk-(?:or-v1-|proj-|live-)?[a-zA-Z0-9]{20,}')), - ("JWT token", re.compile(r'eyJ[a-zA-Z0-9_-]{20,}\.[a-zA-Z0-9_-]{20,}')), - ("GitHub token", re.compile(r'gh[ps]_[a-zA-Z0-9]{36,}')), - ("GitHub OAuth token", re.compile(r'gho_[a-zA-Z0-9]{36,}')), - ("AWS access key", re.compile(r'AKIA[0-9A-Z]{16}')), - ("Supabase key", re.compile(r'sbp_[a-zA-Z0-9]{20,}')), - ("Private key block", re.compile(r'-----BEGIN [A-Z ]+ PRIVATE KEY-----')), - ("Generic secret assignment", re.compile( - r'(?:password|secret|token|api_key|apikey|api_secret|access_token|auth_token)' - r'\s*[=:]\s*["\']?[a-zA-Z0-9_\-/.]{16,}', - re.IGNORECASE, - )), - ("Connection string with credentials", re.compile( - r'(?:postgres|mysql|mongodb|redis)://[^:]+:[^@]+@', - re.IGNORECASE, - )), -] - - -def scan_for_secrets(text: str) -> str | None: - """Return the label of the first secret pattern found, or None if clean.""" - for label, pattern in SECRET_PATTERNS: - if pattern.search(text): - return label - return None +REDACTION_POLICY_FILE = ( + Path(__file__).resolve().parents[2] + / "primitives" + / "sensitive-data-redaction" + / "patterns.json" +) +SENSITIVE_DATA_POLICY = None + + +def _regex_flags(flags: str) -> int: + mask = 0 + if "i" in flags: + mask |= re.IGNORECASE + if "m" in flags: + mask |= re.MULTILINE + if "s" in flags: + mask |= re.DOTALL + return mask + + +def load_sensitive_data_policy() -> dict: + global SENSITIVE_DATA_POLICY + if SENSITIVE_DATA_POLICY is None: + try: + SENSITIVE_DATA_POLICY = json.loads(REDACTION_POLICY_FILE.read_text()) + except Exception as exc: + raise RuntimeError( + "Sensitive-data redaction requires " + f"{REDACTION_POLICY_FILE}. Keep the repo structure intact or pass " + "--no-redact to opt out." + ) from exc + + rules = SENSITIVE_DATA_POLICY.get("rules") + if not isinstance(rules, list): + raise RuntimeError( + f"Invalid sensitive-data redaction policy file: {REDACTION_POLICY_FILE}" + ) + + return SENSITIVE_DATA_POLICY + + +def apply_sensitive_data_policy(text: str) -> dict: + policy = load_sensitive_data_policy() + current = text + findings = [] + total_redactions = 0 + + for rule in policy["rules"]: + label = rule["label"] + action = rule["action"] + pattern = re.compile(rule["pattern"], _regex_flags(rule.get("flags", ""))) + + if action == "skip": + if pattern.search(current): + return { + "text": current, + "skipped": True, + "skip_label": label, + "findings": [{"label": label, "action": "skip", "count": 1}], + "total_redactions": 0, + "policy_version": policy.get("version", "unknown"), + } + continue + + matches = list(pattern.finditer(current)) + if not matches: + continue + + current = pattern.sub(rule.get("placeholder", "[REDACTED]"), current) + findings.append({"label": label, "action": "redact", "count": len(matches)}) + total_redactions += len(matches) + + return { + "text": current, + "skipped": False, + "skip_label": None, + "findings": findings, + "total_redactions": total_redactions, + "policy_version": policy.get("version", "unknown"), + } + + +def format_redaction_findings(findings: list[dict]) -> str: + counts = {} + for finding in findings: + if finding["action"] != "redact": + continue + counts[finding["label"]] = counts.get(finding["label"], 0) + finding["count"] + + parts = [] + for label, count in counts.items(): + if count > 1: + parts.append(f"{label} x{count}") + else: + parts.append(label) + return ", ".join(parts) # Summarization prompt for long sections SUMMARIZATION_PROMPT = """You are extracting atomic thoughts from an Obsidian note section. @@ -478,8 +547,9 @@ def main(): help="Disable LLM chunking (heading splits only, no API cost)") parser.add_argument("--no-embed", action="store_true", help="Skip embedding generation (insert thoughts without vectors)") - parser.add_argument("--no-secret-scan", action="store_true", - help="Disable secret detection (not recommended)") + parser.add_argument("--no-redact", "--no-secret-scan", dest="no_redact", + action="store_true", + help="Disable sensitive-data redaction and skip pass (not recommended)") parser.add_argument("--verbose", action="store_true", help="Show detailed progress") parser.add_argument("--report", action="store_true", @@ -494,6 +564,14 @@ def main(): print(f"Warning: {vault_root} doesn't have a .obsidian/ folder — " "are you sure this is an Obsidian vault?", file=sys.stderr) + redaction_policy_version = None + if not args.no_redact: + try: + redaction_policy_version = load_sensitive_data_policy().get("version", "unknown") + except RuntimeError as exc: + print(f"Error: {exc}", file=sys.stderr) + sys.exit(1) + # Load env vars env_file = Path(__file__).parent / ".env" if env_file.exists(): @@ -585,6 +663,10 @@ def main(): print(f"Chunking: hybrid (headings + LLM fallback)") else: print(f"Chunking: headings only (--no-llm)") + if args.no_redact: + print("Redaction: disabled (--no-redact)") + else: + print(f"Redaction: Sensitive Data Redaction {redaction_policy_version}") print() # ── Stage 1+2: Walk + Parse ────────────────────────────────────────────── @@ -719,27 +801,43 @@ def main(): # ── Dry run summary ────────────────────────────────────────────────────── if args.dry_run: - # Scan for secrets even in dry run so users know before committing - dry_secrets = 0 - if not args.no_secret_scan: + dry_sensitive_skips = 0 + dry_redacted = 0 + dry_redaction_hits = 0 + if not args.no_redact: for t in all_thoughts: - secret_match = scan_for_secrets(t['content']) - if secret_match: - dry_secrets += 1 - title = t['metadata'].get('title', '?') - section = t['metadata'].get('section', '') - location = f"{title} > {section}" if section else title - print(f" SECRET DETECTED: {location} — {secret_match}") + result = apply_sensitive_data_policy(t['content']) + title = t['metadata'].get('title', '?') + section = t['metadata'].get('section', '') + location = f"{title} > {section}" if section else title + + if result['skipped']: + dry_sensitive_skips += 1 + print(f" WOULD SKIP: {location} — {result['skip_label']}") + continue + + if result['total_redactions']: + dry_redacted += 1 + dry_redaction_hits += result['total_redactions'] + print(f" WOULD REDACT: {location} — {format_redaction_findings(result['findings'])}") print() print("=== DRY RUN COMPLETE ===") print(f"Would import {len(all_thoughts)} thoughts from {len(filtered)} notes") - if dry_secrets: - print(f"Would skip {dry_secrets} thoughts containing potential secrets") + if dry_sensitive_skips: + print(f"Would skip {dry_sensitive_skips} thoughts containing high-risk sensitive data") + if dry_redacted: + print(f"Would redact {dry_redacted} thoughts ({dry_redaction_hits} replacements)") if args.verbose: print("\nSample thoughts:") for t in all_thoughts[:5]: - preview = t['content'][:120] + "..." if len(t['content']) > 120 else t['content'] + preview_content = t['content'] + if not args.no_redact: + preview_result = apply_sensitive_data_policy(t['content']) + if preview_result['skipped']: + continue + preview_content = preview_result['text'] + preview = preview_content[:120] + "..." if len(preview_content) > 120 else preview_content print(f" [{t['metadata']['folder']}] {preview}") if args.report: _write_report(all_thoughts, filtered, vault_root, args, skip_reasons, dry_run=True) @@ -756,20 +854,40 @@ def main(): embed_failures = 0 insert_failures = 0 consecutive_failures = 0 - secrets_skipped = 0 + sensitive_skipped = 0 + redacted_thoughts = 0 + redaction_hits = 0 successful_paths = {} # note_path → first insert timestamp for i, thought in enumerate(all_thoughts): - # Scan for secrets before embedding or inserting - if not args.no_secret_scan: - secret_match = scan_for_secrets(thought['content']) - if secret_match: - secrets_skipped += 1 + if not args.no_redact: + result = apply_sensitive_data_policy(thought['content']) + if result['skipped']: + sensitive_skipped += 1 title = thought['metadata'].get('title', '?') section = thought['metadata'].get('section', '') location = f"{title} > {section}" if section else title - print(f" SKIPPED (secret detected): {location} — {secret_match}", flush=True) + print(f" SKIPPED (sensitive-data-redaction): {location} — {result['skip_label']}", + flush=True) continue + if result['total_redactions']: + redacted_thoughts += 1 + redaction_hits += result['total_redactions'] + thought = { + **thought, + 'content': result['text'], + 'fingerprint': content_fingerprint(result['text']), + 'metadata': { + **thought['metadata'], + 'redaction_applied': True, + 'redaction_count': result['total_redactions'], + 'redaction_labels': list(dict.fromkeys( + finding['label'] for finding in result['findings'] + if finding['action'] == 'redact' + )), + 'redaction_version': result['policy_version'], + }, + } # Generate embedding (skip if --no-embed) embedding = None @@ -829,8 +947,10 @@ def main(): print(f" Thoughts inserted: {inserted}") if duplicates: print(f" Duplicates skipped: {duplicates}") - if secrets_skipped: - print(f" Secrets skipped: {secrets_skipped}") + if sensitive_skipped: + print(f" Sensitive skipped: {sensitive_skipped}") + if redacted_thoughts: + print(f" Thoughts redacted: {redacted_thoughts} ({redaction_hits} replacements)") if embed_failures: print(f" Embed failures: {embed_failures}") if insert_failures: diff --git a/recipes/obsidian-vault-import/metadata.json b/recipes/obsidian-vault-import/metadata.json index a4664d33..115bb31c 100644 --- a/recipes/obsidian-vault-import/metadata.json +++ b/recipes/obsidian-vault-import/metadata.json @@ -12,9 +12,10 @@ "services": ["OpenRouter"], "tools": ["Python 3.10+"] }, + "requires_primitives": ["sensitive-data-redaction"], "tags": ["obsidian", "import", "migration", "vault", "markdown", "pkm"], "difficulty": "intermediate", "estimated_time": "20 minutes setup + ~16 min per 1000 thoughts", "created": "2026-03-13", - "updated": "2026-03-23" + "updated": "2026-04-01" }