[integrations] Entity extraction worker by alanshurafa · Pull Request #11 · alanshurafa/OB1

alanshurafa · 2026-04-06T17:04:33Z

Summary

Async Supabase Edge Function that drains the entity_extraction_queue to build a knowledge graph
Ported from ExoCortex production entity-extraction-worker with OB1 adaptations
Depends on schemas/knowledge-graph (PR [schemas] Knowledge graph tables and extraction trigger #5) for entities, edges, thought_entities, and queue tables

What It Does

Processes pending items from the extraction queue in batches. For each thought, calls an LLM to extract named entities (person, project, topic, tool, organization, place) and relationships (works_on, uses, related_to, etc.), then upserts into the graph tables.

Key Features

Batch processing with atomic queue claiming (no duplicate work)
Retry/backoff — up to 5 attempts before permanent failure
Dry-run mode — preview extractions without writing
Symmetric edge dedup — canonical ordering for co_occurs_with/related_to
System-generated skip — ignores thoughts with metadata.generated_by
OpenRouter-first LLM provider order (OB1 standard)

Files

File	Lines	Purpose
`index.ts`	533	Worker with queue management, LLM extraction, graph upserts
`_shared/helpers.ts`	770	Shared utilities (from enhanced-mcp)
`_shared/config.ts`	204	Constants and types
`README.md`	170	Setup guide with backfill SQL, API ref, troubleshooting
`metadata.json`	18	OB1 contribution metadata
`deno.json`	5	Deno import map

Test plan

Verify all gate checks pass
Validate metadata.json against schema
Confirm README has prerequisites, numbered steps, expected outcome
Confirm "05-tool-audit" string in README
Deploy to test project, enqueue thoughts, run worker, verify graph tables populated

🤖 Generated with Claude Code

Async Supabase Edge Function that drains the entity_extraction_queue, calling an LLM to extract people, projects, topics, tools, orgs, and places from thoughts, then building a knowledge graph via entities, edges, and thought_entities tables. Features: batch processing with atomic claiming, retry/backoff with poison-item handling (max 5 attempts), dry-run mode, symmetric edge dedup, system-generated thought skipping. OB1 adaptations: OpenRouter-first LLM provider order, wildcard CORS, model constants from _shared/config.ts, thoughts table (not brain_thoughts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a72f6cb8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-06T17:08:05Z

integrations/entity-extraction-worker/index.ts

+  const dryRun = url.searchParams.get("dry_run") === "true";
+
+  // Step 1: Claim queue items
+  const claimed = await claimQueueItems(limit);


Avoid claiming queue items during dry runs

When dry_run=true, the handler still executes claimQueueItems(limit), which updates queue rows to processing; later the dry-run branch exits without calling markComplete or markError. This means a preview request mutates production queue state and can leave items stuck in processing, so subsequent real runs will skip them until a manual reset.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-06T17:08:05Z

integrations/entity-extraction-worker/index.ts

+    return [];
+  }
+
+  return pending;


Return only queue rows actually claimed

This function returns the originally selected pending rows even though the claim update is a separate statement; under concurrent workers, one worker can select rows, fail to update any of them because another worker claimed first, and still process those thoughts. That creates duplicate extraction work and can inflate graph edge support counts.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-06T17:08:06Z

integrations/entity-extraction-worker/index.ts

+
+    if (thoughtError || !thought?.content) {
+      console.error(`Failed to fetch thought ${item.thought_id}:`, thoughtError);
+      if (!dryRun) await markError(item.thought_id, thoughtError?.message ?? "Thought not found", 0);


Preserve retry count on thought fetch failures

On thought lookup failure, markError is always called with attemptCount hardcoded to 0, so attempt_count is repeatedly reset to 1 instead of incrementing across retries. For missing/deleted thoughts this prevents reaching MAX_ATTEMPTS, causing perpetual requeueing instead of eventual terminal failed status.

Useful? React with 👍 / 👎.

Add blank lines around headings (MD022), fenced code blocks (MD031), and between adjacent blockquotes (MD028). Fix broken link fragment (MD051) and remove extra blank line (MD012). No content changes. These errors were blocking CI on all open PRs since the lint check runs repo-wide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each section's numbered list now restarts at 1 instead of continuing the global count (3-14), satisfying markdownlint MD029/ol-prefix rule. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- dry_run now uses peekQueueItems() (read-only SELECT) instead of claimQueueItems(), so items stay "pending" during preview runs - claimQueueItems() returns only rows actually claimed via .select(), preventing race conditions where concurrent workers see stale results - markError() clears started_at and worker_version when resetting to "pending" so retryable items don't appear stale in monitoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added the integration label Apr 6, 2026

chatgpt-codex-connector bot reviewed Apr 6, 2026

View reviewed changes

alanshurafa and others added 2 commits April 6, 2026 13:33

fix: renumber ordered lists in thought-enrichment README for MD029

6d73085

Each section's numbered list now restarts at 1 instead of continuing the global count (3-14), satisfying markdownlint MD029/ol-prefix rule. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added documentation Improvements or additions to documentation recipe labels Apr 6, 2026

alanshurafa and others added 3 commits April 6, 2026 13:53

[integrations] Add tool audit link to Slack capture README

93560c6

[recipes] Remove secret-like placeholders from thought enrichment README

b4a1ffc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[integrations] Entity extraction worker#11

[integrations] Entity extraction worker#11
alanshurafa wants to merge 6 commits intomainfrom
contrib/alanshurafa/entity-extraction-worker

alanshurafa commented Apr 6, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 6, 2026

Uh oh!

chatgpt-codex-connector bot Apr 6, 2026

Uh oh!

chatgpt-codex-connector bot Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alanshurafa commented Apr 6, 2026

Summary

What It Does

Key Features

Files

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant