feat(enrichment): near-verbatim duplication scan analyzer#1867
Conversation
|
Tip 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 ✅ Gittensory review result - approve/merge recommendedReview updated: 2026-06-30 19:05:58 UTC
✅ Suggested Action - Approve/Merge
Review summary Nits — 7 non-blocking
Review context
Contributor next steps
Signal definitions
🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed 💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →. Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.
|
39adaee to
c356a52
Compare
09ea19f to
8968734
Compare
8968734 to
c0e8da7
Compare
702b10e to
5304a68
Compare
What
A new REES analyzer that flags code a PR adds which is a near-verbatim duplicate of a block that already exists elsewhere in the repo, that is copy-paste instead of importing the existing helper. Each finding is reported as the head file and line versus the source file and line it duplicates, plus the matched line count.
Data source
The GitHub git tree at headSha (one recursive call) plus a bounded set of same-extension candidate blobs, fetched through the shared bounded JSON helper used by the other GitHub-backed analyzers. Everything after the fetch is pure compute, so there is no extra service dependency.
Behavior
Additive and fail-safe. It acts only on changed source files, extracts the added hunks from the unified diff, normalizes lines (trim, collapse whitespace, drop trivial and boilerplate lines such as bare imports), and looks for a contiguous run of at least eight significant lines that reappears verbatim in a candidate file using a rolling-hash window. It is deliberately conservative so incidental overlap is not flagged. Candidates are limited by proximity to the changed files and the scan is bounded on candidate count, blob fetches, and per-file size. It returns no finding and never throws on a missing token or headSha, a bad repo name, a non-ok or malformed git response, or an oversized file. Output is public-safe: file paths, line numbers, and a line count only, never any code content. It follows the established analyzer pattern entirely within review-enrichment (finding type in types.ts, a pure analyzer in analyzers/duplication-scan.ts, a descriptor in analyzers/registry.ts, a public-safe block in render.ts), with the generated analyzer metadata regenerated to match.
Tests
25 node:test cases with a mocked fetch cover real near-verbatim detection with correct head and source line numbers, the conservative non-detection cases (below the run threshold and boilerplate-only), candidate selection (same-extension only, the changed file itself excluded), and the fail-safe paths (no token, no headSha, bad repo name, non-ok or malformed tree, a throwing or non-ok blob skipped, truncated tree, oversized file skipped, query bounding, and an already-aborted signal), plus the public-safe render. Full review-enrichment suite passes.
Closes #1520