feat(enrichment): near-verbatim duplication scan analyzer by GildardoDev · Pull Request #1867 · JSONbored/gittensory

GildardoDev · 2026-06-30T11:23:45Z

What

A new REES analyzer that flags code a PR adds which is a near-verbatim duplicate of a block that already exists elsewhere in the repo, that is copy-paste instead of importing the existing helper. Each finding is reported as the head file and line versus the source file and line it duplicates, plus the matched line count.

Data source

The GitHub git tree at headSha (one recursive call) plus a bounded set of same-extension candidate blobs, fetched through the shared bounded JSON helper used by the other GitHub-backed analyzers. Everything after the fetch is pure compute, so there is no extra service dependency.

Behavior

Additive and fail-safe. It acts only on changed source files, extracts the added hunks from the unified diff, normalizes lines (trim, collapse whitespace, drop trivial and boilerplate lines such as bare imports), and looks for a contiguous run of at least eight significant lines that reappears verbatim in a candidate file using a rolling-hash window. It is deliberately conservative so incidental overlap is not flagged. Candidates are limited by proximity to the changed files and the scan is bounded on candidate count, blob fetches, and per-file size. It returns no finding and never throws on a missing token or headSha, a bad repo name, a non-ok or malformed git response, or an oversized file. Output is public-safe: file paths, line numbers, and a line count only, never any code content. It follows the established analyzer pattern entirely within review-enrichment (finding type in types.ts, a pure analyzer in analyzers/duplication-scan.ts, a descriptor in analyzers/registry.ts, a public-safe block in render.ts), with the generated analyzer metadata regenerated to match.

Tests

25 node:test cases with a mocked fetch cover real near-verbatim detection with correct head and source line numbers, the conservative non-detection cases (below the run threshold and boilerplate-only), candidate selection (same-extension only, the changed file itself excluded), and the fail-safe paths (no token, no headSha, bad repo name, non-ok or malformed tree, a throwing or non-ok blob skipped, truncated tree, oversized file skipped, query bounding, and an already-aborted signal), plus the public-safe render. Full review-enrichment suite passes.

Closes #1520

gittensory-orb · 2026-06-30T11:42:10Z

Tip

🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩

✅ Gittensory review result - approve/merge recommended

_{Review updated: 2026-06-30 19:05:58 UTC}

9 files · 1 AI reviewer · no blockers · readiness 55/100 · CI green · unknown

✅ Suggested Action - Approve/Merge

safe to merge

Review summary
The change adds a bounded GitHub-backed duplication analyzer, wires it into the REES registry, metadata, UI docs, rendering, and adds focused unit coverage for detection, fail-safe fetch behavior, and budget limits. The current diff addresses the cross-extension candidate starvation issue by bucketing candidates per extension and round-robin fetching under a global cap, so the visible implementation is coherent and safe enough to proceed. The main remaining concern is precision: same-file duplication is intentionally missed because the changed file is excluded from candidates while scanning the head tree.

Nits — 7 non-blocking

nit: review-enrichment/src/analyzers/duplication-scan.ts:424 excludes every changed path from candidates, so a newly added block copied from an unchanged block elsewhere in the same modified file will not be reported; document that limitation or add a same-file-safe comparison strategy.
nit: review-enrichment/src/analyzers/duplication-scan.ts:503 keeps scanning all added blocks after MAX_FINDINGS is already exceeded, so large matching PRs can do unnecessary CPU work even though the return value is capped.
nit: review-enrichment/test/duplication-scan.test.ts uses domain-flavored fixture identifiers that distract from the analyzer behavior; neutral fixture names would make the tests easier to maintain.
In review-enrichment/src/analyzers/duplication-scan.ts:424, either document that same-file duplicates are out of scope or compare added blocks against pre-added/source-side blocks from the same file so same-file copy-paste can be caught without self-matching the added hunk.
In review-enrichment/src/analyzers/duplication-scan.ts:503, stop once findings reaches MAX_FINDINGS if deterministic top-N ordering is not needed beyond the cap, or keep a bounded top-N heap if longest-match ordering matters.
Pull request duplicates other open work — Check for an existing pull request or issue covering this change and coordinate or consolidate before continuing.
Readiness score is below the configured threshold — Use the readiness panel as advisory maintainer context; the score does not block this PR.

Signal	Result	Evidence
Code review	✅ No blockers	1 reviewer
Linked issue	✅ Linked	#1520
Related work	⚠️ 3 scoped overlaps	Top overlaps are listed below; lower-confidence bulk is hidden.
Change scope	❌ 8/20	High review scope from cached public metadata (size label size:XL; 1 linked issue).
Validation posture	❌ 5/25	Preflight is holding this PR; address the blocker before review.
Contributor workload	✅ 10/10	Author activity: 195 registered-repo PR(s), 144 merged, 1 issue(s).
Contributor context	✅ Confirmed Gittensor contributor	GildardoDev; Gittensor profile; 195 PR(s), 1 issue(s).
Gate result	✅ Passing	No configured blocker found.

Review context

Author: GildardoDev
Role context: outside_contributor
Public audience mode: oss maintainer
Lane context: Repository registration is not available in the local Gittensory cache.
Public profile languages: not available
Official Gittensor activity: 195 PR(s), 1 issue(s).
Related work: Titles/paths share 6 meaningful terms. (PR #1882)
Related work: Titles/paths share 6 meaningful terms. (PR #1864)
Related work: Items reference the same linked issue feat(enrichment): Churn-hotspot + bug-density scorer #1513. (issue #1513, PR #1882)
Additional title-only matches omitted; title-only overlap does not block.

Contributor next steps

Review top overlaps.
Add a concise scope and risk note.
Fix the blocker.
Triage stale or unlinked PRs.
Refresh registry data or choose a registered active repo.
Check active issues and PRs before submitting.

Signal definitions

Related work = same linked issue, overlapping active PRs, or title/path similarity.
Change scope = cached public metadata such as size labels, draft state, and review-burden hints.
Validation posture = whether the PR provides enough public validation/test evidence for maintainer review.
Contributor workload = public contributor activity and cleanup pressure, not a repo-wide quality failure.
Contributor context = public GitHub/Gittensor identity context; non-Gittensor status is not a blocker.

_{🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed}

💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →.

Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.

Re-run Gittensory review

…iew briefs

GildardoDev requested a review from JSONbored as a code owner June 30, 2026 11:23

dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 30, 2026

superagent-security Bot removed the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 30, 2026

github-actions Bot deployed to preview/pr-1867 June 30, 2026 11:24 View deployment

JSONbored assigned GildardoDev Jun 30, 2026

JSONbored added this to gittensory - v1 roadmap Jun 30, 2026

github-project-automation Bot moved this to Todo in gittensory - v1 roadmap Jun 30, 2026

JSONbored added this to the M3.5 — Agent Layer Phase 2 (maintainer auto-maintain) milestone Jun 30, 2026

gittensory-orb Bot added gittensor Gittensor contributor context gittensor:feature Gittensor-scored feature linked to a feature issue — scores a 1.25x multiplier. labels Jun 30, 2026

gittensory-orb Bot mentioned this pull request Jun 30, 2026

feat(enrichment): register docCommentDrift in engine REES_ANALYZER_NAMES #1861

Merged

2 tasks

dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 30, 2026

gittensory-orb Bot mentioned this pull request Jun 30, 2026

feat(enrichment): derive linked issue from PR body when unset #1864

Merged

2 tasks

github-actions Bot deployed to preview/pr-1867 June 30, 2026 12:15 View deployment

GildardoDev force-pushed the feat/duplication-scan branch from 39adaee to c356a52 Compare June 30, 2026 12:27

github-actions Bot deployed to preview/pr-1867 June 30, 2026 12:29 View deployment

github-actions Bot deployed to preview/pr-1867 June 30, 2026 12:43 View deployment

GildardoDev force-pushed the feat/duplication-scan branch from 09ea19f to 8968734 Compare June 30, 2026 12:53

github-actions Bot deployed to preview/pr-1867 June 30, 2026 12:55 View deployment

GildardoDev force-pushed the feat/duplication-scan branch from 8968734 to c0e8da7 Compare June 30, 2026 13:03

github-actions Bot deployed to preview/pr-1867 June 30, 2026 13:04 View deployment

gittensory-orb Bot mentioned this pull request Jun 30, 2026

feat(enrichment): churn-hotspot analyzer #1882

Merged

github-actions Bot deployed to preview/pr-1867 June 30, 2026 17:52 View deployment

This was referenced Jun 30, 2026

feat(rees): standardize sentry tags and fingerprints #1878

Merged

feat(github): memoize live review facts per request #1872

Merged

feat(signals): classify Elixir, Swift, and Gradle lockfiles #1885

Merged

feat(enrichment): add near-verbatim duplication scan analyzer for rev…

5304a68

…iew briefs

gittensory-orb Bot mentioned this pull request Jun 30, 2026

feat(selfhost): cache stable GitHub GraphQL reads #1880

Open

12 tasks

GildardoDev force-pushed the feat/duplication-scan branch from 702b10e to 5304a68 Compare June 30, 2026 18:17

github-actions Bot deployed to preview/pr-1867 June 30, 2026 18:18 View deployment

gittensory-orb Bot mentioned this pull request Jun 30, 2026

feat(selfhost): expand Sentry observability context for self-host runtime #1881

Closed

14 tasks

Merge branch 'main' into feat/duplication-scan

cdcff93

github-actions Bot deployed to preview/pr-1867 June 30, 2026 18:32 View deployment

JSONbored approved these changes Jun 30, 2026

View reviewed changes

dosubot Bot added the lgtm Approved by a maintainer. label Jun 30, 2026

JSONbored merged commit 6859151 into JSONbored:main Jun 30, 2026
8 checks passed

github-project-automation Bot moved this from Todo to Done in gittensory - v1 roadmap Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(enrichment): near-verbatim duplication scan analyzer#1867

feat(enrichment): near-verbatim duplication scan analyzer#1867
JSONbored merged 2 commits into
JSONbored:mainfrom
GildardoDev:feat/duplication-scan

GildardoDev commented Jun 30, 2026

Uh oh!

gittensory-orb Bot commented Jun 30, 2026 •

edited

Loading

✅ Gittensory review result - approve/merge recommended

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

GildardoDev commented Jun 30, 2026

What

Data source

Behavior

Tests

Uh oh!

gittensory-orb Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Gittensory review result - approve/merge recommended

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gittensory-orb Bot commented Jun 30, 2026 •

edited

Loading