Improve data quality by zverianskii · Pull Request #15 · withmartian/code-review-benchmark

zverianskii · 2026-03-05T08:54:14Z

No description provided.

Copilot

Pull request overview

This PR improves dataset quality by (1) extracting/mending missing PR author metadata and (2) adding “quality” filtering (self-authored PRs, non-empty reviews) to both the Streamlit dashboard and the Rust API service.

Changes:

Extend PR metadata extraction to derive title/author/created-at from IssueCommentEvent payloads when needed.
Add an async backfill job + CLI command to populate prs.pr_author from the GitHub REST API and persist raw PR API responses.
Add “Exclude self-authored” + “Require reviews” filters end-to-end (DB query → API snapshot → UI controls).

Reviewed changes

Copilot reviewed 15 out of 21 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
online/etl/pipeline/discover.py	Populate PR metadata from `IssueCommentEvent` payloads.
online/etl/pipeline/assemble.py	Same metadata fallback during assembly.
online/etl/pipeline/backfill_pr_author.py	New async backfill worker pool to fetch `pr_author` and store `pr_api_raw`.
online/etl/main.py	Wire new `backfill-pr-author` subcommand.
online/etl/db/schema.py	Add `pr_api_raw` column migration.
online/etl/dashboard/data.py	Include `pr_author` + `has_reviews` in analyses query.
online/etl/dashboard/app.py	Add Streamlit quality filter controls and apply them.
online/etl/config.py	Add an additional default chatbot username.
online/api_service/static/index.html	Add quality filter checkboxes and propagate query params.
online/api_service/src/model.rs	Add `self_authored`/`has_reviews` fields + filter params.
online/api_service/src/handlers.rs	Parse new query params into `FilterParams`.
online/api_service/src/db.rs	Load `pr_author` + `has_reviews`; compute `self_authored`.
online/api_service/src/compute.rs	Enforce new filters in `record_matches`.
offline/results/openai_gpt-5.2/candidates.json	Update offline results data (new model key/entries).
offline/results/anthropic_claude-opus-4-5-20251101/candidates.json	Update offline results data (new model key/entries).
.gitignore	Ignore additional local/editor files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-05T08:59:35Z

online/etl/main.py

+    cfg = DBConfig(verbose=args.verbose)
+    if args.database_url:
+        cfg.database_url = args.database_url
+
+    db = DBAdapter(cfg.database_url)


cmd_backfill_pr_author creates DBConfig but never builds/validates the GitHub token list (unlike cmd_enrich). If env vars are missing, this can run with an empty token and then write incorrect placeholders. Reuse the token-selection + validation logic from cmd_enrich before calling the backfill.

Copilot · 2026-03-05T08:59:35Z

online/etl/pipeline/backfill_pr_author.py

+    tokens = cfg.github_tokens if cfg.github_tokens else [cfg.github_token]
+    pool = TokenPool(tokens)
+    n_tokens = pool.size
+    n_workers = n_tokens * 10
+    logger.info(f"Using {n_tokens} GitHub token(s), {n_workers} workers")


tokens = cfg.github_tokens if cfg.github_tokens else [cfg.github_token] will produce [""] when neither GITHUB_TOKENS nor GITHUB_TOKEN is set (DBConfig defaults to empty string). That leads to unauthenticated API calls and, with the current logic, can mass-update rows incorrectly. Validate tokens here (non-empty strings) and fail fast if none are configured.

Copilot · 2026-03-05T08:59:35Z

online/etl/pipeline/backfill_pr_author.py

+                                # API returned but no user — mark as empty string
+                                # so we don't re-fetch
+                                await db.execute(
+                                    *db._translate_params(
+                                        "UPDATE prs SET pr_author = '' WHERE id = $1",


The “no author” path updates pr_author to an empty string to avoid re-fetching. However _fetch_pr_author returns None for transient errors and bad credentials too (because rest_get returns None in those cases), so this can permanently mask recoverable failures. Prefer leaving pr_author as NULL (or marking status/error) unless you can distinguish a non-retriable condition (e.g., 404).

Copilot · 2026-03-05T08:59:35Z

online/etl/pipeline/backfill_pr_author.py

+                        try:
+                            pr_author, raw_data = await _fetch_pr_author(
+                                gh, owner, repo, pr_number
+                            )
+                            if pr_author is not None:


Inside the worker, only RateLimitExhaustedError is handled. Any other exception (DB errors, unexpected response, JSON decode) will bubble out and likely abort the whole batch, while batch_errors remains unused. Add a broader per-item exception handler that increments batch_errors (and optionally marks the row as error) so one bad PR doesn’t stop the run.

Copilot · 2026-03-05T08:59:36Z

online/etl/pipeline/discover.py

                meta["pr_created_at"] = pr_obj.get("created_at")
+        elif event["type"] == "IssueCommentEvent":
+            issue_obj = payload.get("issue", {})
+            pr_obj = issue_obj.get("pull_request", {})


In the IssueCommentEvent branch, pr_obj = issue_obj.get('pull_request', {}) is assigned but never used. Either remove it or use it to guard that this IssueCommentEvent is actually tied to a PR (e.g., skip if issue_obj.get('pull_request') is missing).

Suggested change

pr_obj = issue_obj.get("pull_request", {})

Copilot · 2026-03-05T08:59:36Z

online/etl/pipeline/assemble.py

                meta["pr_created_at"] = pr_obj.get("created_at")
+        elif event["type"] == "IssueCommentEvent":
+            issue_obj = payload.get("issue", {})
+            pr_obj = issue_obj.get("pull_request", {})


In the IssueCommentEvent branch, pr_obj = issue_obj.get('pull_request', {}) is assigned but unused. Consider removing it, or use it as an explicit guard that the comment belongs to a PR before copying title/author/created_at from the issue object.

Suggested change

pr_obj = issue_obj.get("pull_request", {})

pr_obj = issue_obj.get("pull_request")

# Only treat IssueCommentEvent as PR-related if the issue has a pull_request field

if not pr_obj:

continue

Copilot · 2026-03-05T08:59:36Z

online/etl/main.py

+    p_bfa = sub.add_parser("backfill-pr-author", help="Backfill pr_author from GitHub API")
+    p_bfa.add_argument("--database-url")
+    p_bfa.add_argument("--batch-size", type=int, default=5000)
+    p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")


The backfill-pr-author CLI doesn’t accept --github-token/--github-tokens, and the command implementation doesn’t validate that any token is configured. This makes the command hard to use and risks running unauthenticated. Consider adding token flags (matching enrich) and failing fast if no usable token(s) are present.

Suggested change

p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")

p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")

p_bfa.add_argument(

"--github-token",

help="GitHub token to use for API requests (single token; may be specified multiple times)",

action="append",

dest="github_tokens",

)

p_bfa.add_argument(

"--github-tokens",

help="Comma-separated list of GitHub tokens to use for API requests",

)

Git commit author names don't reliably match GitHub usernames (e.g. "Copilot" vs "copilot[bot]"), so the commits fallback is now behind --use-commits flag (off by default). Made-with: Cursor

Previously 301 Moved Permanently was treated as an error, causing all API calls for renamed repos to fail. Now follows the Location header, which GitHub provides with the stable repo ID URL. Made-with: Cursor

- ON CONFLICT now merges new BQ events with existing ones (deduplicated by event_id) instead of discarding them. This means subsequent discover runs will capture merge/close events that weren't in the original scan. - Add repo_id BIGINT column extracted from BQ repo.id (stable across renames/transfers). Future: change unique constraint to use repo_id. - Add backfill-metadata command to backfill pr_merged from pr_api_raw and repo_id from both bq_events and pr_api_raw. Made-with: Cursor

- _extract_pr_metadata: True always wins over False for pr_merged (previously a close-without-merge event could mask a later merge) - backfill-metadata now also: - Corrects pr_merged=False rows where API says merged=True - Patches assembled JSON to match corrected pr_merged values - Reports remaining gaps after backfill - Removed redundant Phase 2 (bq_events won't have repo_id yet)

New command fetches pr_api_raw from GitHub API for analyzed PRs missing it, and sets pr_merged + repo_id from the response. Uses the same async worker pool with rate limit handling. Also adds progress logging every 10k rows to all backfill-metadata phases. Made-with: Cursor

Made-with: Cursor

…ch backfill phases - Only analyze merged PRs: open/closed-without-merge PRs stay in 'assembled' until lifecycle completes. Rust API also filters. - Reset PR status to 'pending' when discover merges new BQ events, triggering re-enrich/assemble/analyze with the updated timeline. - Batch Phase 2 and Phase 3 of backfill-metadata (5k rows at a time) to avoid OOM on large datasets with big assembled/pr_api_raw blobs. Made-with: Cursor

add filter for self authored prs

6758d56

Copilot AI review requested due to automatic review settings March 5, 2026 08:54

Copilot started reviewing on behalf of zverianskii March 5, 2026 08:54 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

ashleyzhang01 added 10 commits March 19, 2026 01:42

Merge remote-tracking branch 'origin/main' into improve-data-quality

5ab3243

update backfll script

b48e4b5

Merge remote-tracking branch 'origin/main' into improve-data-quality

a817a4c

Make commits fallback opt-in for pr_author backfill

c63320e

Git commit author names don't reliably match GitHub usernames (e.g. "Copilot" vs "copilot[bot]"), so the commits fallback is now behind --use-commits flag (off by default). Made-with: Cursor

Follow 301 redirects in rest_get for renamed repos

192a444

Previously 301 Moved Permanently was treated as an error, causing all API calls for renamed repos to fail. Now follows the Location header, which GitHub provides with the stable repo ID URL. Made-with: Cursor

Show individual PRs in backfill-api-raw dry run output

7859e24

Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve data quality#15

Improve data quality#15
zverianskii wants to merge 11 commits intomainfrom
improve-data-quality

zverianskii commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Copilot AI Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            pr_obj = issue_obj.get("pull_request", {})
+            pr_obj = issue_obj.get("pull_request")
+            # Only treat IssueCommentEvent as PR-related if the issue has a pull_request field
+            if not pr_obj:
+                continue

-    p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")
+    p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)")
+    p_bfa.add_argument(
+        "--github-token",
+        help="GitHub token to use for API requests (single token; may be specified multiple times)",
+        action="append",
+        dest="github_tokens",
+    )
+    p_bfa.add_argument(
+        "--github-tokens",
+        help="Comma-separated list of GitHub tokens to use for API requests",
+    )

Conversation

zverianskii commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants