Skip to content

fix: enable publish dedup by default, add source-URL dedup, forward pre-selected taxonomies#823

Merged
chubes4 merged 1 commit intomainfrom
fix/wire-publish-dedup-and-taxonomy
Mar 16, 2026
Merged

fix: enable publish dedup by default, add source-URL dedup, forward pre-selected taxonomies#823
chubes4 merged 1 commit intomainfrom
fix/wire-publish-dedup-and-taxonomy

Conversation

@chubes4
Copy link
Member

@chubes4 chubes4 commented Mar 16, 2026

Summary

  • Publish dedup enabled by defaultdedup_enabled now defaults to true and treats missing config as enabled. No more opt-in malarkey.
  • Source URL dedup — published posts store _datamachine_source_url meta. The datamachine/check-duplicate ability checks source URL before title similarity, preventing same-source republishing even with different AI-rewritten titles.
  • Pre-selected taxonomy forwarding — the WordPress publish handler now passes pre-selected taxonomy selections (e.g., fixed festival/location assignments) through to the publish ability. Previously only ai_decides selections were forwarded.
  • Respects link_handling config for source attribution instead of always appending.

Root cause

Wire was publishing 3 copies of the same Reddit thread because:

  1. Triple root jobs were spawned per scheduled run (fixed by batch_state_missing fix in v0.42.0)
  2. Parallel fetches grabbed the same data before processed-items could mark it
  3. Publish dedup was disabled by default so nothing caught it downstream
  4. Pre-selected taxonomies were silently dropped, leaving ~300 Wire posts without festival/location

Testing

  • Added test_check_duplicate_finds_published_post_by_source_url to DuplicateCheckAbilityTest
  • Syntax validated all changed files

…re-selected taxonomies

Three fixes for Wire duplicate publishing and missing taxonomy assignment:

1. Publish-level dedup is now enabled by default instead of opt-in.
   Flows without explicit dedup_enabled config are now protected.

2. Source URL dedup added to check-duplicate ability. Published posts
   now store _datamachine_source_url meta, and the duplicate checker
   queries it before falling back to title similarity. Prevents
   same-source republishing even when AI rewrites the title.

3. Pre-selected taxonomy selections are now forwarded through the
   WordPress publish handler to the ability. Previously only
   ai_decides selections were passed, causing flows with fixed
   festival/location assignments to silently drop them.

Also respects link_handling config for source attribution instead
of always appending.

Closes #704
@chubes4 chubes4 merged commit 57bf98b into main Mar 16, 2026
2 of 3 checks passed
@chubes4 chubes4 deleted the fix/wire-publish-dedup-and-taxonomy branch March 16, 2026 23:00
@github-actions
Copy link

github-actions bot commented Mar 16, 2026

Homeboy Results — data-machine

Lint

⚡ Scope: changed files only

lint (changed files only)

Test

Failure Digest

Test Failure Digest

Autofixability classification

  • Overall: human_needed
  • Autofix enabled: no
  • Autofix attempted this run: no
  • Human-needed failed commands:
    • test
  • Failed commands with available automated fixes:
    • test
  • Automated fixes are disabled for this step. Commands with available fix support in this run: test

Machine-readable artifacts

  • homeboy-lint-summary.json
  • homeboy-test-failures.json
  • homeboy-audit-summary.json
  • homeboy-autofixability.json

⚡ Scope: changed files only

test (changed files only)

Audit

⚡ Scope: changed files only

audit (changed files only)

Tooling versions
  • Homeboy CLI: homeboy 0.78.0+b6dd9e7b
  • Extension: wordpress from https://github.com/Extra-Chill/homeboy-extensions
  • Extension revision: unknown
  • Action: Extra-Chill/homeboy-action@v1

Homeboy Action v1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant