Skip to content

[integrations] Entity extraction worker#11

Open
alanshurafa wants to merge 6 commits intomainfrom
contrib/alanshurafa/entity-extraction-worker
Open

[integrations] Entity extraction worker#11
alanshurafa wants to merge 6 commits intomainfrom
contrib/alanshurafa/entity-extraction-worker

Conversation

@alanshurafa
Copy link
Copy Markdown
Owner

Summary

  • Async Supabase Edge Function that drains the entity_extraction_queue to build a knowledge graph
  • Ported from ExoCortex production entity-extraction-worker with OB1 adaptations
  • Depends on schemas/knowledge-graph (PR [schemas] Knowledge graph tables and extraction trigger #5) for entities, edges, thought_entities, and queue tables

What It Does

Processes pending items from the extraction queue in batches. For each thought, calls an LLM to extract named entities (person, project, topic, tool, organization, place) and relationships (works_on, uses, related_to, etc.), then upserts into the graph tables.

Key Features

  • Batch processing with atomic queue claiming (no duplicate work)
  • Retry/backoff — up to 5 attempts before permanent failure
  • Dry-run mode — preview extractions without writing
  • Symmetric edge dedup — canonical ordering for co_occurs_with/related_to
  • System-generated skip — ignores thoughts with metadata.generated_by
  • OpenRouter-first LLM provider order (OB1 standard)

Files

File Lines Purpose
index.ts 533 Worker with queue management, LLM extraction, graph upserts
_shared/helpers.ts 770 Shared utilities (from enhanced-mcp)
_shared/config.ts 204 Constants and types
README.md 170 Setup guide with backfill SQL, API ref, troubleshooting
metadata.json 18 OB1 contribution metadata
deno.json 5 Deno import map

Test plan

  • Verify all gate checks pass
  • Validate metadata.json against schema
  • Confirm README has prerequisites, numbered steps, expected outcome
  • Confirm "05-tool-audit" string in README
  • Deploy to test project, enqueue thoughts, run worker, verify graph tables populated

🤖 Generated with Claude Code

Async Supabase Edge Function that drains the entity_extraction_queue,
calling an LLM to extract people, projects, topics, tools, orgs, and
places from thoughts, then building a knowledge graph via entities,
edges, and thought_entities tables.

Features: batch processing with atomic claiming, retry/backoff with
poison-item handling (max 5 attempts), dry-run mode, symmetric edge
dedup, system-generated thought skipping.

OB1 adaptations: OpenRouter-first LLM provider order, wildcard CORS,
model constants from _shared/config.ts, thoughts table (not
brain_thoughts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a72f6cb8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const dryRun = url.searchParams.get("dry_run") === "true";

// Step 1: Claim queue items
const claimed = await claimQueueItems(limit);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid claiming queue items during dry runs

When dry_run=true, the handler still executes claimQueueItems(limit), which updates queue rows to processing; later the dry-run branch exits without calling markComplete or markError. This means a preview request mutates production queue state and can leave items stuck in processing, so subsequent real runs will skip them until a manual reset.

Useful? React with 👍 / 👎.

return [];
}

return pending;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return only queue rows actually claimed

This function returns the originally selected pending rows even though the claim update is a separate statement; under concurrent workers, one worker can select rows, fail to update any of them because another worker claimed first, and still process those thoughts. That creates duplicate extraction work and can inflate graph edge support counts.

Useful? React with 👍 / 👎.


if (thoughtError || !thought?.content) {
console.error(`Failed to fetch thought ${item.thought_id}:`, thoughtError);
if (!dryRun) await markError(item.thought_id, thoughtError?.message ?? "Thought not found", 0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve retry count on thought fetch failures

On thought lookup failure, markError is always called with attemptCount hardcoded to 0, so attempt_count is repeatedly reset to 1 instead of incrementing across retries. For missing/deleted thoughts this prevents reaching MAX_ATTEMPTS, causing perpetual requeueing instead of eventual terminal failed status.

Useful? React with 👍 / 👎.

alanshurafa and others added 2 commits April 6, 2026 13:33
Add blank lines around headings (MD022), fenced code blocks (MD031),
and between adjacent blockquotes (MD028). Fix broken link fragment
(MD051) and remove extra blank line (MD012). No content changes.

These errors were blocking CI on all open PRs since the lint check
runs repo-wide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each section's numbered list now restarts at 1 instead of continuing
the global count (3-14), satisfying markdownlint MD029/ol-prefix rule.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation recipe labels Apr 6, 2026
alanshurafa and others added 3 commits April 6, 2026 13:53
- dry_run now uses peekQueueItems() (read-only SELECT) instead of
  claimQueueItems(), so items stay "pending" during preview runs
- claimQueueItems() returns only rows actually claimed via .select(),
  preventing race conditions where concurrent workers see stale results
- markError() clears started_at and worker_version when resetting to
  "pending" so retryable items don't appear stale in monitoring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation integration recipe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant