Skip to content

feat(enrichment): near-verbatim duplication scan analyzer#1867

Merged
JSONbored merged 2 commits into
JSONbored:mainfrom
GildardoDev:feat/duplication-scan
Jun 30, 2026
Merged

feat(enrichment): near-verbatim duplication scan analyzer#1867
JSONbored merged 2 commits into
JSONbored:mainfrom
GildardoDev:feat/duplication-scan

Conversation

@GildardoDev

Copy link
Copy Markdown
Contributor

What

A new REES analyzer that flags code a PR adds which is a near-verbatim duplicate of a block that already exists elsewhere in the repo, that is copy-paste instead of importing the existing helper. Each finding is reported as the head file and line versus the source file and line it duplicates, plus the matched line count.

Data source

The GitHub git tree at headSha (one recursive call) plus a bounded set of same-extension candidate blobs, fetched through the shared bounded JSON helper used by the other GitHub-backed analyzers. Everything after the fetch is pure compute, so there is no extra service dependency.

Behavior

Additive and fail-safe. It acts only on changed source files, extracts the added hunks from the unified diff, normalizes lines (trim, collapse whitespace, drop trivial and boilerplate lines such as bare imports), and looks for a contiguous run of at least eight significant lines that reappears verbatim in a candidate file using a rolling-hash window. It is deliberately conservative so incidental overlap is not flagged. Candidates are limited by proximity to the changed files and the scan is bounded on candidate count, blob fetches, and per-file size. It returns no finding and never throws on a missing token or headSha, a bad repo name, a non-ok or malformed git response, or an oversized file. Output is public-safe: file paths, line numbers, and a line count only, never any code content. It follows the established analyzer pattern entirely within review-enrichment (finding type in types.ts, a pure analyzer in analyzers/duplication-scan.ts, a descriptor in analyzers/registry.ts, a public-safe block in render.ts), with the generated analyzer metadata regenerated to match.

Tests

25 node:test cases with a mocked fetch cover real near-verbatim detection with correct head and source line numbers, the conservative non-detection cases (below the run threshold and boilerplate-only), candidate selection (same-extension only, the changed file itself excluded), and the fail-safe paths (no token, no headSha, bad repo name, non-ok or malformed tree, a throwing or non-ok blob skipped, truncated tree, oversized file skipped, query bounding, and an already-aborted signal), plus the public-safe render. Full review-enrichment suite passes.

Closes #1520

@GildardoDev GildardoDev requested a review from JSONbored as a code owner June 30, 2026 11:23
@dosubot dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 30, 2026
@superagent-security superagent-security Bot removed the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 30, 2026
@gittensory-orb

gittensory-orb Bot commented Jun 30, 2026

Copy link
Copy Markdown

Tip

🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩

✅ Gittensory review result - approve/merge recommended

Review updated: 2026-06-30 19:05:58 UTC

9 files · 1 AI reviewer · no blockers · readiness 55/100 · CI green · unknown

✅ Suggested Action - Approve/Merge

  • safe to merge

Review summary
The change adds a bounded GitHub-backed duplication analyzer, wires it into the REES registry, metadata, UI docs, rendering, and adds focused unit coverage for detection, fail-safe fetch behavior, and budget limits. The current diff addresses the cross-extension candidate starvation issue by bucketing candidates per extension and round-robin fetching under a global cap, so the visible implementation is coherent and safe enough to proceed. The main remaining concern is precision: same-file duplication is intentionally missed because the changed file is excluded from candidates while scanning the head tree.

Nits — 7 non-blocking
  • nit: review-enrichment/src/analyzers/duplication-scan.ts:424 excludes every changed path from candidates, so a newly added block copied from an unchanged block elsewhere in the same modified file will not be reported; document that limitation or add a same-file-safe comparison strategy.
  • nit: review-enrichment/src/analyzers/duplication-scan.ts:503 keeps scanning all added blocks after MAX_FINDINGS is already exceeded, so large matching PRs can do unnecessary CPU work even though the return value is capped.
  • nit: review-enrichment/test/duplication-scan.test.ts uses domain-flavored fixture identifiers that distract from the analyzer behavior; neutral fixture names would make the tests easier to maintain.
  • In review-enrichment/src/analyzers/duplication-scan.ts:424, either document that same-file duplicates are out of scope or compare added blocks against pre-added/source-side blocks from the same file so same-file copy-paste can be caught without self-matching the added hunk.
  • In review-enrichment/src/analyzers/duplication-scan.ts:503, stop once findings reaches MAX_FINDINGS if deterministic top-N ordering is not needed beyond the cap, or keep a bounded top-N heap if longest-match ordering matters.
  • Pull request duplicates other open work — Check for an existing pull request or issue covering this change and coordinate or consolidate before continuing.
  • Readiness score is below the configured threshold — Use the readiness panel as advisory maintainer context; the score does not block this PR.
Signal Result Evidence
Code review ✅ No blockers 1 reviewer
Linked issue ✅ Linked #1520
Related work ⚠️ 3 scoped overlaps Top overlaps are listed below; lower-confidence bulk is hidden.
Change scope ❌ 8/20 High review scope from cached public metadata (size label size:XL; 1 linked issue).
Validation posture ❌ 5/25 Preflight is holding this PR; address the blocker before review.
Contributor workload ✅ 10/10 Author activity: 195 registered-repo PR(s), 144 merged, 1 issue(s).
Contributor context ✅ Confirmed Gittensor contributor GildardoDev; Gittensor profile; 195 PR(s), 1 issue(s).
Gate result ✅ Passing No configured blocker found.
Review context
  • Author: GildardoDev
  • Role context: outside_contributor
  • Public audience mode: oss maintainer
  • Lane context: Repository registration is not available in the local Gittensory cache.
  • Public profile languages: not available
  • Official Gittensor activity: 195 PR(s), 1 issue(s).
  • Related work: Titles/paths share 6 meaningful terms. (PR #1882)
  • Related work: Titles/paths share 6 meaningful terms. (PR #1864)
  • Related work: Items reference the same linked issue feat(enrichment): Churn-hotspot + bug-density scorer #1513. (issue #1513, PR #1882)
  • Additional title-only matches omitted; title-only overlap does not block.
Contributor next steps
  • Review top overlaps.
  • Add a concise scope and risk note.
  • Fix the blocker.
  • Triage stale or unlinked PRs.
  • Refresh registry data or choose a registered active repo.
  • Check active issues and PRs before submitting.
Signal definitions
  • Related work = same linked issue, overlapping active PRs, or title/path similarity.
  • Change scope = cached public metadata such as size labels, draft state, and review-burden hints.
  • Validation posture = whether the PR provides enough public validation/test evidence for maintainer review.
  • Contributor workload = public contributor activity and cleanup pressure, not a repo-wide quality failure.
  • Contributor context = public GitHub/Gittensor identity context; non-Gittensor status is not a blocker.

🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed


💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →.

Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.

  • Re-run Gittensory review

@gittensory-orb gittensory-orb Bot added gittensor Gittensor contributor context gittensor:feature Gittensor-scored feature linked to a feature issue — scores a 1.25x multiplier. labels Jun 30, 2026
@dosubot dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 30, 2026
@GildardoDev GildardoDev force-pushed the feat/duplication-scan branch from 39adaee to c356a52 Compare June 30, 2026 12:27
@GildardoDev GildardoDev force-pushed the feat/duplication-scan branch from 09ea19f to 8968734 Compare June 30, 2026 12:53
@GildardoDev GildardoDev force-pushed the feat/duplication-scan branch from 8968734 to c0e8da7 Compare June 30, 2026 13:03
@dosubot dosubot Bot added the lgtm Approved by a maintainer. label Jun 30, 2026
@JSONbored JSONbored merged commit 6859151 into JSONbored:main Jun 30, 2026
8 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in gittensory - v1 roadmap Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gittensor:feature Gittensor-scored feature linked to a feature issue — scores a 1.25x multiplier. gittensor Gittensor contributor context lgtm Approved by a maintainer. size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

feat(enrichment): Full-file / near-verbatim duplication scan

2 participants