Skip to content

[integrations] Smart ingest edge function#100

Closed
alanshurafa wants to merge 6 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/smart-ingest
Closed

[integrations] Smart ingest edge function#100
alanshurafa wants to merge 6 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/smart-ingest

Conversation

@alanshurafa
Copy link
Copy Markdown
Contributor

Summary

  • Supabase Edge Function that extracts atomic thoughts from raw text via LLM
  • Three-stage pipeline: extract → deduplicate (fingerprint + semantic) → reconcile
  • Four reconciliation actions: add, skip, append_evidence, create_revision
  • Dry-run mode for previewing without mutations
  • Multi-provider support: Anthropic, OpenAI, OpenRouter with automatic fallback
  • All IDs are UUID/opaque strings throughout

Dependencies

Routes

  • POST /smart-ingest — Extract and reconcile (dry_run or immediate)
  • POST /smart-ingest/execute — Execute a previously dry-run job

Test plan

  • Dry-run extracts and reconciles without writes
  • Execute commits a dry-run job successfully
  • Exact duplicate → skip (fingerprint match)
  • Semantic near-duplicate (>0.92) → skip
  • Similar but richer existing → append_evidence
  • Similar but new has more info → create_revision
  • Job/item/thought references work with UUID IDs
  • Provider fallback chain works (Anthropic → OpenAI → OpenRouter)

Tested against a production instance with 75K+ thoughts.

🤖 Generated with Claude Code

alanshurafa and others added 3 commits March 21, 2026 22:09
LLM-powered document extraction with semantic deduplication, fingerprint
matching, and dry-run preview. Supports Anthropic, OpenAI, and OpenRouter
providers with automatic fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename "Steps" to "Step-by-step instructions" for OB1 review bot
- Replace relative links to schemas/ingestion-jobs (not in this branch)
  with plain text references to avoid broken link check failures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OB1 review bot checks for lines starting with '1.' — convert bold
numbered steps to standard markdown numbered list format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@justfinethanku justfinethanku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Smart Ingest Edge Function

Thank you for this contribution! This is a well-thought-out integration that adds valuable document ingestion capabilities to Open Brain. I've completed a thorough review against the OB1 contribution standards.

✅ What's Good

  1. Excellent documentation — The README is comprehensive with clear sections for prerequisites, step-by-step instructions, API reference, expected outcomes, and troubleshooting. The credential tracker is a nice touch.

  2. Clean code structure — Edge Function follows best practices with proper error handling, CORS headers, and environment variable usage (no hardcoded credentials).

  3. Multi-provider support — Automatic fallback chain (Anthropic → OpenAI → OpenRouter) provides flexibility and resilience.

  4. Dry-run workflow — The preview-before-commit pattern is user-friendly and prevents accidental data writes.

  5. Security — No dangerous SQL operations (DROP, TRUNCATE, unqualified DELETE), no hardcoded secrets, proper authentication via x-brain-key header.

  6. Remote MCP pattern — Correctly uses Supabase Edge Function deployment (not local Node.js server), complying with OB1 standards.

  7. Metadata valid — metadata.json has all required fields with correct types and values.

🔴 Blocking Issue: Missing Dependency

This PR depends on PR #98 (Ingestion Jobs schema), which is still OPEN.

The contribution references:

  • Tables: ingestion_jobs, ingestion_items
  • RPCs: append_thought_evidence (in addition to core upsert_thought and match_thoughts)

These are documented in Prerequisites and Step 1, and the troubleshooting section addresses the missing schema error. However, PR #98 must be merged first before this contribution can be tested or used by the community.

Recommendation:

  • Mark this PR as draft or blocked until #98 is merged
  • OR merge #98 first, then review and merge this PR

📋 Minor Suggestions (Non-Blocking)

  1. README improvements:

    • Consider adding a "What You'll Learn" or "Use Cases" section to help users understand when to use smart-ingest vs. other import methods
    • The API Reference section could include response schema examples
    • Step 3 says "Copy the contents of index.ts" — consider providing a direct curl/wget command to download the file (similar to other OB1 contributions)
  2. Code considerations:

    • The SEMANTIC_SKIP_THRESHOLD (0.92) and SEMANTIC_MATCH_THRESHOLD (0.85) are hardcoded. Consider documenting these as tunable parameters in a comment or README section
    • The MAX_THOUGHTS_PER_EXTRACTION (20) limit might be worth mentioning in the README's "What It Does" section
  3. Metadata:

    • The services field says "Anthropic API or OpenAI API or OpenRouter" — technically only one is required, but embeddings require OpenAI or OpenRouter (not Anthropic). This could be clearer, though the README does explain it correctly in Prerequisites.

✅ Verification Checklist

  • Folder structure correct (integrations/smart-ingest/)
  • Required files present (README.md, metadata.json, index.ts)
  • metadata.json valid and complete
  • No credentials or secrets
  • SQL safety (no dangerous operations)
  • README has Prerequisites, Step-by-step instructions, Expected Outcome, Troubleshooting
  • PR title format correct: [integrations] Smart ingest edge function
  • No binary files over 1MB
  • Remote MCP pattern (Edge Function, not local server)
  • All changes within contribution folder
  • Dependencies available — ❌ Depends on unmerged PR #98

Verdict: Significant changes needed

The contribution quality is excellent, but it cannot be merged until the dependency (PR #98 - Ingestion Jobs schema) is merged first. Once #98 lands, this PR will be ready to merge with only minor optional improvements.


Next steps:

  1. Merge PR #98 first
  2. Address the dependency blocker (either rebase or just wait)
  3. Optionally consider the minor suggestions above
  4. Re-request review

Great work overall! The implementation is solid, the documentation is thorough, and this will be a valuable addition to the OB1 ecosystem.

Copy link
Copy Markdown
Collaborator

@justfinethanku justfinethanku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Smart Ingest Edge Function

Thank you for this contribution! This is a well-thought-out integration that adds valuable document ingestion capabilities to Open Brain. I've completed a thorough review against the OB1 contribution standards.

✅ What's Good

  1. Excellent documentation — The README is comprehensive with clear sections for prerequisites, step-by-step instructions, API reference, expected outcomes, and troubleshooting. The credential tracker is a nice touch.

  2. Clean code structure — Edge Function follows best practices with proper error handling, CORS headers, and environment variable usage (no hardcoded credentials).

  3. Multi-provider support — Automatic fallback chain (Anthropic → OpenAI → OpenRouter) provides flexibility and resilience.

  4. Dry-run workflow — The preview-before-commit pattern is user-friendly and prevents accidental data writes.

  5. Security — No dangerous SQL operations (DROP, TRUNCATE, unqualified DELETE), no hardcoded secrets, proper authentication via x-brain-key header.

  6. Remote MCP pattern — Correctly uses Supabase Edge Function deployment (not local Node.js server), complying with OB1 standards.

  7. Metadata valid — metadata.json has all required fields with correct types and values.

🔴 Blocking Issue: Missing Dependency

This PR depends on PR #98 (Ingestion Jobs schema), which is still OPEN.

The contribution references:

  • Tables: ingestion_jobs, ingestion_items
  • RPCs: append_thought_evidence (in addition to core upsert_thought and match_thoughts)

These are documented in Prerequisites and Step 1, and the troubleshooting section addresses the missing schema error. However, PR #98 must be merged first before this contribution can be tested or used by the community.

Recommendation:

  • Mark this PR as draft or blocked until #98 is merged
  • OR merge #98 first, then review and merge this PR

📋 Minor Suggestions (Non-Blocking)

  1. README improvements:

    • Consider adding a "What You'll Learn" or "Use Cases" section to help users understand when to use smart-ingest vs. other import methods
    • The API Reference section could include response schema examples
    • Step 3 says "Copy the contents of index.ts" — consider providing a direct curl/wget command to download the file (similar to other OB1 contributions)
  2. Code considerations:

    • The SEMANTIC_SKIP_THRESHOLD (0.92) and SEMANTIC_MATCH_THRESHOLD (0.85) are hardcoded. Consider documenting these as tunable parameters in a comment or README section
    • The MAX_THOUGHTS_PER_EXTRACTION (20) limit might be worth mentioning in the README's "What It Does" section
  3. Metadata:

    • The services field says "Anthropic API or OpenAI API or OpenRouter" — technically only one is required, but embeddings require OpenAI or OpenRouter (not Anthropic). This could be clearer, though the README does explain it correctly in Prerequisites.

✅ Verification Checklist

  • Folder structure correct (integrations/smart-ingest/)
  • Required files present (README.md, metadata.json, index.ts)
  • metadata.json valid and complete
  • No credentials or secrets
  • SQL safety (no dangerous operations)
  • README has Prerequisites, Step-by-step instructions, Expected Outcome, Troubleshooting
  • PR title format correct: [integrations] Smart ingest edge function
  • No binary files over 1MB
  • Remote MCP pattern (Edge Function, not local server)
  • All changes within contribution folder
  • Dependencies available — ❌ Depends on unmerged PR #98

Verdict: Significant changes needed

The contribution quality is excellent, but it cannot be merged until the dependency (PR #98 - Ingestion Jobs schema) is merged first. Once #98 lands, this PR will be ready to merge with only minor optional improvements.


Next steps:

  1. Merge PR #98 first
  2. Address the dependency blocker (either rebase or just wait)
  3. Optionally consider the minor suggestions above
  4. Re-request review

Great work overall! The implementation is solid, the documentation is thorough, and this will be a valuable addition to the OB1 ecosystem.

alanshurafa and others added 2 commits March 25, 2026 09:31
Add Use Cases section, document dedup thresholds with rationale,
clarify that embeddings require OpenAI/OpenRouter (not Anthropic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI requires tool audit guide link for integrations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@matthallett1
Copy link
Copy Markdown
Collaborator

@claude review this PR

@github-actions
Copy link
Copy Markdown

Claude Code is working…

I'll analyze this and get back to you.

View job run

@matthallett1
Copy link
Copy Markdown
Collaborator

@claude review this PR

@github-actions
Copy link
Copy Markdown

Claude Code is working…

I'll analyze this and get back to you.

View job run

@matthallett1
Copy link
Copy Markdown
Collaborator

@claude review this PR

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 30, 2026

Claude finished @matthallett1's task in 1m 32s —— View job


PR Review: Smart Ingest Edge Function

  • Read changed files (README, index.ts, metadata.json)
  • Review for alignment, clarity, safety, design patterns, scope
  • Post final review

Building on the existing review from @justfinethanku, which already covered the structural/mechanical checks well. I'll focus on design patterns, code quality, and issues requiring judgment.

Alignment

Strong fit. This is exactly the kind of capture pathway OB1 is built for — raw text in, atomic thoughts in the thoughts table via the established upsert_thought/match_thoughts RPCs. The dry-run/execute workflow is a genuine UX improvement for bulk ingestion that doesn't currently exist in the ecosystem.

Design Patterns

All the fundamentals are correct:

  • Remote Edge Function (not a local Node.js server) ✅
  • Writes via upsert_thought RPC, not direct SQL ✅
  • Semantic search via match_thoughts RPC ✅
  • JSONB metadata fields ✅
  • Environment variables for all credentials ✅
  • x-brain-key auth pattern ✅

Code Issues

🔴 Bug: content_fingerprint is lost during execute

In handleExecuteJob (line 742), content_fingerprint is set to "" when reconstructing the item from the database:

const fakeItem: IngestionItem = {
  content: item.extracted_content,
  content_fingerprint: "",  // ← empty!
  action: item.action,
  ...
};

When executeItem calls upsert_thought for an add action, it passes content_fingerprint: item.content_fingerprint — which will be the empty string. This means thoughts added via the execute route won't have a fingerprint, breaking future deduplication for those thoughts. The fingerprint computed during dry-run should be persisted in ingestion_items and restored here. Fix this →

🟡 Weak richness heuristic (line 466)

The create_revision vs append_evidence decision uses raw character count as a proxy for "richness":

if (existingContent.length >= thought.content.length) {
  return { ...base, action: "append_evidence", ... };
} else {
  return { ...base, action: "create_revision", ... };
}

This is fragile — a longer but vaguer existing thought will always win over a shorter but more precise new one. A word-count comparison would be minimally better, but a small LLM call to compare semantic richness would be ideal. This is worth calling out because it silently makes the wrong call in realistic cases (e.g., existing: "We discussed the API redesign at length and Sarah raised many valid points about the architecture" vs new: "Decision: switch to GraphQL, prototype by Friday").

🟡 Auth is optional by default (line 799)

if (MCP_ACCESS_KEY && !isAuthorized(req)) {

If the user forgets to set MCP_ACCESS_KEY, the function is publicly accessible with no authentication. Combined with --no-verify-jwt in Step 5, the function would accept any request. The README should warn: "If you skip MCP_ACCESS_KEY, anyone with your function URL can trigger ingestion and read job results." Fix this →

🟢 Minor: reprocess hash versioning uses LIKE

The version counter query (line 854) uses .like("input_hash", \${baseHash}%`)`. This works fine at personal scale but is worth a comment explaining why (avoiding a count query on potentially many rows).

Dependency Blocker

Confirming what @justfinethanku flagged: this PR cannot be merged until PR #98 (ingestion-jobs schema) lands. The ingestion_jobs and ingestion_items tables, and the append_thought_evidence RPC, are all prerequisites. The README handles this correctly in Prerequisites and Troubleshooting, but the code won't work without #98 in place.

Summary

The implementation is solid and the documentation is excellent — especially the threshold table, use case list, and credential tracker. This is a genuinely useful addition that's well-aligned with OB1's patterns.

Blockers before merge:

  1. PR [schemas] Ingestion jobs schema for document ingest #98 must be merged first (existing blocker, already flagged)
  2. The content_fingerprint bug in handleExecuteJob should be fixed — it silently breaks deduplication for executed dry-run jobs

Worth addressing:

  • Document the auth-bypass risk when MCP_ACCESS_KEY is unset
  • Consider noting the richness heuristic's limitations in a code comment

Once #98 lands and the fingerprint bug is fixed, this is ready to merge.

Copy link
Copy Markdown
Collaborator

@matthallett1 matthallett1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Smart Ingest Edge Function

This is the most ambitious contribution in the queue. 1,005 lines of TypeScript, multi-provider LLM support, semantic dedup with configurable thresholds, dry-run/execute workflow. The architecture is sound.

Issues to fix

1. Depends on #98 (ingestion-jobs schema) which has changes requested
This PR references ingestion_jobs and ingestion_items tables. PR #98 needs to be merged first. Please note this dependency in the PR description.

2. References RPCs that may not exist

  • append_thought_evidence RPC — where is this defined? Not in core OB1 schema.
  • match_thoughts RPC — needs pgvector similarity search function. Is this documented somewhere?

Users deploying this will hit errors if these RPCs don't exist. Either include the SQL definitions or document exactly which RPCs need to be in place.

3. Inconsistent extraction prompts across providers
callOpenAI wraps the output in {"thoughts": [...]} for JSON mode, but callAnthropic and callOpenRouter expect a raw JSON array. Different prompt formats across providers could produce different extraction results for the same input.

4. parseExtractedThoughts not shown in diff
The extraction functions call parseExtractedThoughts(raw) but I don't see this function defined in the visible portion of the diff. Is it in the truncated middle section?

What's good

  • Four-way reconciliation (add/skip/append_evidence/create_revision) is well-designed
  • Content fingerprint normalization matches the dedup primitive
  • Dry-run with separate execute is the right pattern for destructive-ish operations
  • Multi-provider fallback chain (Anthropic > OpenAI > OpenRouter) is flexible
  • Idempotency check with versioned hashing for reprocess
  • Job persistence for auditability

Fix the dependency docs and missing RPC definitions, and this is a strong contribution.

Add Required RPCs table to README listing append_thought_evidence,
match_thoughts, and upsert_thought with their source PRs. Clarify
that PR NateBJones-Projects#98 (ingestion-jobs schema) must merge first.

Normalize extraction prompts across all three providers (Anthropic,
OpenAI, OpenRouter) to consistently request {"thoughts": [...]}
wrapper format. Parser already handles both formats as fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the integration Contribution: MCP extension or capture source label Apr 6, 2026
@alanshurafa
Copy link
Copy Markdown
Contributor Author

Review feedback from @matthallett1 addressed in commit 50d9cbe:

  1. Dependency documented: Added explicit note that PR [schemas] Ingestion jobs schema for document ingest #98 (ingestion-jobs schema) must merge first. Updated prerequisites section.
  2. Required RPCs documented: Added table listing append_thought_evidence, match_thoughts, and upsert_thought with their source and purpose.
  3. Extraction prompts normalized: All three providers (Anthropic, OpenAI, OpenRouter) now use the same system prompt requesting {"thoughts": [...]} format. Provider-specific prompt suffixes removed. The parser already handles both bare arrays and wrapped objects as fallback.
  4. Re: parseExtractedThoughts — This function IS defined at line 305 of index.ts. It was not visible in the reviewer's truncated diff view but is present in the PR.

Ready for re-review once PR #98 merges.

@alanshurafa
Copy link
Copy Markdown
Contributor Author

Closing this PR as part of the OB1 Alpha Milestone consolidation. This feature is being rebuilt as one of 12 clean, gate-compliant PRs that together form the alpha upgrade path. The consolidated PRs will be submitted once verified on the fork. See the full plan for details.

@alanshurafa alanshurafa closed this Apr 6, 2026
@alanshurafa alanshurafa reopened this Apr 6, 2026
@alanshurafa
Copy link
Copy Markdown
Contributor Author

Closing as part of the OB1 Alpha Milestone consolidation. This feature is being rebuilt as one of 12 clean, gate-compliant PRs that together form the alpha upgrade path. The consolidated PRs will be submitted once verified on the fork.

@alanshurafa alanshurafa closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration Contribution: MCP extension or capture source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants