[integrations] Smart ingest edge function by alanshurafa · Pull Request #100 · NateBJones-Projects/OB1

alanshurafa · 2026-03-22T02:10:51Z

Summary

Supabase Edge Function that extracts atomic thoughts from raw text via LLM
Three-stage pipeline: extract → deduplicate (fingerprint + semantic) → reconcile
Four reconciliation actions: add, skip, append_evidence, create_revision
Dry-run mode for previewing without mutations
Multi-provider support: Anthropic, OpenAI, OpenRouter with automatic fallback
All IDs are UUID/opaque strings throughout

Dependencies

Ingestion Jobs schema ([schemas] Ingestion jobs schema for document ingest #98) — ingestion_jobs + ingestion_items tables
upsert_thought and match_thoughts RPCs (from core Open Brain setup)
append_thought_evidence RPC (from Ingestion Jobs schema)

Routes

POST /smart-ingest — Extract and reconcile (dry_run or immediate)
POST /smart-ingest/execute — Execute a previously dry-run job

Test plan

Dry-run extracts and reconciles without writes
Execute commits a dry-run job successfully
Exact duplicate → skip (fingerprint match)
Semantic near-duplicate (>0.92) → skip
Similar but richer existing → append_evidence
Similar but new has more info → create_revision
Job/item/thought references work with UUID IDs
Provider fallback chain works (Anthropic → OpenAI → OpenRouter)

Tested against a production instance with 75K+ thoughts.

🤖 Generated with Claude Code

LLM-powered document extraction with semantic deduplication, fingerprint matching, and dry-run preview. Supports Anthropic, OpenAI, and OpenRouter providers with automatic fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Rename "Steps" to "Step-by-step instructions" for OB1 review bot - Replace relative links to schemas/ingestion-jobs (not in this branch) with plain text references to avoid broken link check failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

OB1 review bot checks for lines starting with '1.' — convert bold numbered steps to standard markdown numbered list format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

justfinethanku

Code Review: Smart Ingest Edge Function

Thank you for this contribution! This is a well-thought-out integration that adds valuable document ingestion capabilities to Open Brain. I've completed a thorough review against the OB1 contribution standards.

✅ What's Good

Excellent documentation — The README is comprehensive with clear sections for prerequisites, step-by-step instructions, API reference, expected outcomes, and troubleshooting. The credential tracker is a nice touch.
Clean code structure — Edge Function follows best practices with proper error handling, CORS headers, and environment variable usage (no hardcoded credentials).
Multi-provider support — Automatic fallback chain (Anthropic → OpenAI → OpenRouter) provides flexibility and resilience.
Dry-run workflow — The preview-before-commit pattern is user-friendly and prevents accidental data writes.
Security — No dangerous SQL operations (DROP, TRUNCATE, unqualified DELETE), no hardcoded secrets, proper authentication via x-brain-key header.
Remote MCP pattern — Correctly uses Supabase Edge Function deployment (not local Node.js server), complying with OB1 standards.
Metadata valid — metadata.json has all required fields with correct types and values.

🔴 Blocking Issue: Missing Dependency

This PR depends on PR #98 (Ingestion Jobs schema), which is still OPEN.

The contribution references:

Tables: ingestion_jobs, ingestion_items
RPCs: append_thought_evidence (in addition to core upsert_thought and match_thoughts)

These are documented in Prerequisites and Step 1, and the troubleshooting section addresses the missing schema error. However, PR #98 must be merged first before this contribution can be tested or used by the community.

Recommendation:

Mark this PR as draft or blocked until #98 is merged
OR merge #98 first, then review and merge this PR

📋 Minor Suggestions (Non-Blocking)

README improvements:
- Consider adding a "What You'll Learn" or "Use Cases" section to help users understand when to use smart-ingest vs. other import methods
- The API Reference section could include response schema examples
- Step 3 says "Copy the contents of index.ts" — consider providing a direct curl/wget command to download the file (similar to other OB1 contributions)
Code considerations:
- The SEMANTIC_SKIP_THRESHOLD (0.92) and SEMANTIC_MATCH_THRESHOLD (0.85) are hardcoded. Consider documenting these as tunable parameters in a comment or README section
- The MAX_THOUGHTS_PER_EXTRACTION (20) limit might be worth mentioning in the README's "What It Does" section
Metadata:
- The services field says "Anthropic API or OpenAI API or OpenRouter" — technically only one is required, but embeddings require OpenAI or OpenRouter (not Anthropic). This could be clearer, though the README does explain it correctly in Prerequisites.

✅ Verification Checklist

Verdict: Significant changes needed

The contribution quality is excellent, but it cannot be merged until the dependency (PR #98 - Ingestion Jobs schema) is merged first. Once #98 lands, this PR will be ready to merge with only minor optional improvements.

Next steps:

Merge PR #98 first
Address the dependency blocker (either rebase or just wait)
Optionally consider the minor suggestions above
Re-request review

Great work overall! The implementation is solid, the documentation is thorough, and this will be a valuable addition to the OB1 ecosystem.

justfinethanku

Code Review: Smart Ingest Edge Function

Thank you for this contribution! This is a well-thought-out integration that adds valuable document ingestion capabilities to Open Brain. I've completed a thorough review against the OB1 contribution standards.

✅ What's Good

Excellent documentation — The README is comprehensive with clear sections for prerequisites, step-by-step instructions, API reference, expected outcomes, and troubleshooting. The credential tracker is a nice touch.
Clean code structure — Edge Function follows best practices with proper error handling, CORS headers, and environment variable usage (no hardcoded credentials).
Multi-provider support — Automatic fallback chain (Anthropic → OpenAI → OpenRouter) provides flexibility and resilience.
Dry-run workflow — The preview-before-commit pattern is user-friendly and prevents accidental data writes.
Security — No dangerous SQL operations (DROP, TRUNCATE, unqualified DELETE), no hardcoded secrets, proper authentication via x-brain-key header.
Remote MCP pattern — Correctly uses Supabase Edge Function deployment (not local Node.js server), complying with OB1 standards.
Metadata valid — metadata.json has all required fields with correct types and values.

🔴 Blocking Issue: Missing Dependency

This PR depends on PR #98 (Ingestion Jobs schema), which is still OPEN.

The contribution references:

Tables: ingestion_jobs, ingestion_items
RPCs: append_thought_evidence (in addition to core upsert_thought and match_thoughts)

These are documented in Prerequisites and Step 1, and the troubleshooting section addresses the missing schema error. However, PR #98 must be merged first before this contribution can be tested or used by the community.

Recommendation:

Mark this PR as draft or blocked until #98 is merged
OR merge #98 first, then review and merge this PR

📋 Minor Suggestions (Non-Blocking)

README improvements:
- Consider adding a "What You'll Learn" or "Use Cases" section to help users understand when to use smart-ingest vs. other import methods
- The API Reference section could include response schema examples
- Step 3 says "Copy the contents of index.ts" — consider providing a direct curl/wget command to download the file (similar to other OB1 contributions)
Code considerations:
- The SEMANTIC_SKIP_THRESHOLD (0.92) and SEMANTIC_MATCH_THRESHOLD (0.85) are hardcoded. Consider documenting these as tunable parameters in a comment or README section
- The MAX_THOUGHTS_PER_EXTRACTION (20) limit might be worth mentioning in the README's "What It Does" section
Metadata:
- The services field says "Anthropic API or OpenAI API or OpenRouter" — technically only one is required, but embeddings require OpenAI or OpenRouter (not Anthropic). This could be clearer, though the README does explain it correctly in Prerequisites.

✅ Verification Checklist

Verdict: Significant changes needed

The contribution quality is excellent, but it cannot be merged until the dependency (PR #98 - Ingestion Jobs schema) is merged first. Once #98 lands, this PR will be ready to merge with only minor optional improvements.

Next steps:

Merge PR #98 first
Address the dependency blocker (either rebase or just wait)
Optionally consider the minor suggestions above
Re-request review

Great work overall! The implementation is solid, the documentation is thorough, and this will be a valuable addition to the OB1 ecosystem.

Add Use Cases section, document dedup thresholds with rationale, clarify that embeddings require OpenAI/OpenRouter (not Anthropic). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CI requires tool audit guide link for integrations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

matthallett1 · 2026-03-30T19:14:25Z

@claude review this PR

github-actions · 2026-03-30T19:14:41Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

matthallett1 · 2026-03-30T19:23:05Z

@claude review this PR

github-actions · 2026-03-30T19:23:18Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

matthallett1 · 2026-03-30T20:14:14Z

@claude review this PR

github-actions · 2026-03-30T20:14:34Z

Claude finished @matthallett1's task in 1m 32s —— View job

PR Review: Smart Ingest Edge Function

Read changed files (README, index.ts, metadata.json)
Review for alignment, clarity, safety, design patterns, scope
Post final review

Building on the existing review from @justfinethanku, which already covered the structural/mechanical checks well. I'll focus on design patterns, code quality, and issues requiring judgment.

Alignment

Strong fit. This is exactly the kind of capture pathway OB1 is built for — raw text in, atomic thoughts in the thoughts table via the established upsert_thought/match_thoughts RPCs. The dry-run/execute workflow is a genuine UX improvement for bulk ingestion that doesn't currently exist in the ecosystem.

Design Patterns

All the fundamentals are correct:

Remote Edge Function (not a local Node.js server) ✅
Writes via upsert_thought RPC, not direct SQL ✅
Semantic search via match_thoughts RPC ✅
JSONB metadata fields ✅
Environment variables for all credentials ✅
x-brain-key auth pattern ✅

Code Issues

🔴 Bug: content_fingerprint is lost during execute

In handleExecuteJob (line 742), content_fingerprint is set to "" when reconstructing the item from the database:

const fakeItem: IngestionItem = {
  content: item.extracted_content,
  content_fingerprint: "",  // ← empty!
  action: item.action,
  ...
};

When executeItem calls upsert_thought for an add action, it passes content_fingerprint: item.content_fingerprint — which will be the empty string. This means thoughts added via the execute route won't have a fingerprint, breaking future deduplication for those thoughts. The fingerprint computed during dry-run should be persisted in ingestion_items and restored here. Fix this →

🟡 Weak richness heuristic (line 466)

The create_revision vs append_evidence decision uses raw character count as a proxy for "richness":

if (existingContent.length >= thought.content.length) {
  return { ...base, action: "append_evidence", ... };
} else {
  return { ...base, action: "create_revision", ... };
}

This is fragile — a longer but vaguer existing thought will always win over a shorter but more precise new one. A word-count comparison would be minimally better, but a small LLM call to compare semantic richness would be ideal. This is worth calling out because it silently makes the wrong call in realistic cases (e.g., existing: "We discussed the API redesign at length and Sarah raised many valid points about the architecture" vs new: "Decision: switch to GraphQL, prototype by Friday").

🟡 Auth is optional by default (line 799)

if (MCP_ACCESS_KEY && !isAuthorized(req)) {

If the user forgets to set MCP_ACCESS_KEY, the function is publicly accessible with no authentication. Combined with --no-verify-jwt in Step 5, the function would accept any request. The README should warn: "If you skip MCP_ACCESS_KEY, anyone with your function URL can trigger ingestion and read job results." Fix this →

🟢 Minor: reprocess hash versioning uses LIKE

The version counter query (line 854) uses .like("input_hash", \${baseHash}%`)`. This works fine at personal scale but is worth a comment explaining why (avoiding a count query on potentially many rows).

Dependency Blocker

Confirming what @justfinethanku flagged: this PR cannot be merged until PR #98 (ingestion-jobs schema) lands. The ingestion_jobs and ingestion_items tables, and the append_thought_evidence RPC, are all prerequisites. The README handles this correctly in Prerequisites and Troubleshooting, but the code won't work without #98 in place.

Summary

The implementation is solid and the documentation is excellent — especially the threshold table, use case list, and credential tracker. This is a genuinely useful addition that's well-aligned with OB1's patterns.

Blockers before merge:

PR [schemas] Ingestion jobs schema for document ingest #98 must be merged first (existing blocker, already flagged)
The content_fingerprint bug in handleExecuteJob should be fixed — it silently breaks deduplication for executed dry-run jobs

Worth addressing:

Document the auth-bypass risk when MCP_ACCESS_KEY is unset
Consider noting the richness heuristic's limitations in a code comment

Once #98 lands and the fingerprint bug is fixed, this is ready to merge.

matthallett1

Review — Smart Ingest Edge Function

This is the most ambitious contribution in the queue. 1,005 lines of TypeScript, multi-provider LLM support, semantic dedup with configurable thresholds, dry-run/execute workflow. The architecture is sound.

Issues to fix

1. Depends on #98 (ingestion-jobs schema) which has changes requested
This PR references ingestion_jobs and ingestion_items tables. PR #98 needs to be merged first. Please note this dependency in the PR description.

2. References RPCs that may not exist

append_thought_evidence RPC — where is this defined? Not in core OB1 schema.
match_thoughts RPC — needs pgvector similarity search function. Is this documented somewhere?

Users deploying this will hit errors if these RPCs don't exist. Either include the SQL definitions or document exactly which RPCs need to be in place.

3. Inconsistent extraction prompts across providers
callOpenAI wraps the output in {"thoughts": [...]} for JSON mode, but callAnthropic and callOpenRouter expect a raw JSON array. Different prompt formats across providers could produce different extraction results for the same input.

4. parseExtractedThoughts not shown in diff
The extraction functions call parseExtractedThoughts(raw) but I don't see this function defined in the visible portion of the diff. Is it in the truncated middle section?

What's good

Four-way reconciliation (add/skip/append_evidence/create_revision) is well-designed
Content fingerprint normalization matches the dedup primitive
Dry-run with separate execute is the right pattern for destructive-ish operations
Multi-provider fallback chain (Anthropic > OpenAI > OpenRouter) is flexible
Idempotency check with versioned hashing for reprocess
Job persistence for auditability

Fix the dependency docs and missing RPC definitions, and this is a strong contribution.

Add Required RPCs table to README listing append_thought_evidence, match_thoughts, and upsert_thought with their source PRs. Clarify that PR NateBJones-Projects#98 (ingestion-jobs schema) must merge first. Normalize extraction prompts across all three providers (Anthropic, OpenAI, OpenRouter) to consistently request {"thoughts": [...]} wrapper format. Parser already handles both formats as fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alanshurafa · 2026-04-06T02:27:25Z

Review feedback from @matthallett1 addressed in commit 50d9cbe:

Dependency documented: Added explicit note that PR [schemas] Ingestion jobs schema for document ingest #98 (ingestion-jobs schema) must merge first. Updated prerequisites section.
Required RPCs documented: Added table listing append_thought_evidence, match_thoughts, and upsert_thought with their source and purpose.
Extraction prompts normalized: All three providers (Anthropic, OpenAI, OpenRouter) now use the same system prompt requesting {"thoughts": [...]} format. Provider-specific prompt suffixes removed. The parser already handles both bare arrays and wrapped objects as fallback.
Re: parseExtractedThoughts — This function IS defined at line 305 of index.ts. It was not visible in the reviewer's truncated diff view but is present in the PR.

Ready for re-review once PR #98 merges.

alanshurafa · 2026-04-06T14:01:02Z

Closing this PR as part of the OB1 Alpha Milestone consolidation. This feature is being rebuilt as one of 12 clean, gate-compliant PRs that together form the alpha upgrade path. The consolidated PRs will be submitted once verified on the fork. See the full plan for details.

alanshurafa · 2026-04-06T15:02:56Z

Closing as part of the OB1 Alpha Milestone consolidation. This feature is being rebuilt as one of 12 clean, gate-compliant PRs that together form the alpha upgrade path. The consolidated PRs will be submitted once verified on the fork.

alanshurafa and others added 3 commits March 21, 2026 22:09

Fix steps format to use markdown numbered lists

bbd37f0

OB1 review bot checks for lines starting with '1.' — convert bold numbered steps to standard markdown numbered list format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

justfinethanku mentioned this pull request Mar 24, 2026

[recipes] Universal ingest primitives #99

Closed

4 tasks

justfinethanku reviewed Mar 24, 2026

View reviewed changes

alanshurafa and others added 2 commits March 25, 2026 09:31

[integrations] Address review feedback for smart ingest

e5b1211

Add Use Cases section, document dedup thresholds with rationale, clarify that embeddings require OpenAI/OpenRouter (not Anthropic). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

[integrations] Add tool audit link for smart ingest

53cbaaf

CI requires tool audit guide link for integrations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

matthallett1 requested changes Mar 31, 2026

View reviewed changes

shansen3 mentioned this pull request Mar 31, 2026

[dashboards] Link Next.js dashboard to open-brain-rest dependency #130

Closed

10 tasks

github-actions bot added the integration Contribution: MCP extension or capture source label Apr 6, 2026

alanshurafa closed this Apr 6, 2026

alanshurafa reopened this Apr 6, 2026

alanshurafa closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[integrations] Smart ingest edge function#100

[integrations] Smart ingest edge function#100
alanshurafa wants to merge 6 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/smart-ingest

alanshurafa commented Mar 22, 2026

Uh oh!

justfinethanku left a comment

Uh oh!

justfinethanku left a comment

Uh oh!

matthallett1 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

matthallett1 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

matthallett1 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

matthallett1 left a comment

Uh oh!

alanshurafa commented Apr 6, 2026

Uh oh!

alanshurafa commented Apr 6, 2026

Uh oh!

alanshurafa commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alanshurafa commented Mar 22, 2026

Summary

Dependencies

Routes

Test plan

Uh oh!

justfinethanku left a comment

Choose a reason for hiding this comment

Code Review: Smart Ingest Edge Function

✅ What's Good

🔴 Blocking Issue: Missing Dependency

📋 Minor Suggestions (Non-Blocking)

✅ Verification Checklist

Verdict: Significant changes needed

Uh oh!

justfinethanku left a comment

Choose a reason for hiding this comment

Code Review: Smart Ingest Edge Function

✅ What's Good

🔴 Blocking Issue: Missing Dependency

📋 Minor Suggestions (Non-Blocking)

✅ Verification Checklist

Verdict: Significant changes needed

Uh oh!

matthallett1 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

matthallett1 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

matthallett1 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Smart Ingest Edge Function

Alignment

Design Patterns

Code Issues

Dependency Blocker

Summary

Uh oh!

matthallett1 left a comment

Choose a reason for hiding this comment

Review — Smart Ingest Edge Function

Issues to fix

What's good

Uh oh!

alanshurafa commented Apr 6, 2026

Uh oh!

alanshurafa commented Apr 6, 2026

Uh oh!

alanshurafa commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 30, 2026 •

edited

Loading