Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves dataset quality by (1) extracting/mending missing PR author metadata and (2) adding “quality” filtering (self-authored PRs, non-empty reviews) to both the Streamlit dashboard and the Rust API service.
Changes:
- Extend PR metadata extraction to derive title/author/created-at from
IssueCommentEventpayloads when needed. - Add an async backfill job + CLI command to populate
prs.pr_authorfrom the GitHub REST API and persist raw PR API responses. - Add “Exclude self-authored” + “Require reviews” filters end-to-end (DB query → API snapshot → UI controls).
Reviewed changes
Copilot reviewed 15 out of 21 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| online/etl/pipeline/discover.py | Populate PR metadata from IssueCommentEvent payloads. |
| online/etl/pipeline/assemble.py | Same metadata fallback during assembly. |
| online/etl/pipeline/backfill_pr_author.py | New async backfill worker pool to fetch pr_author and store pr_api_raw. |
| online/etl/main.py | Wire new backfill-pr-author subcommand. |
| online/etl/db/schema.py | Add pr_api_raw column migration. |
| online/etl/dashboard/data.py | Include pr_author + has_reviews in analyses query. |
| online/etl/dashboard/app.py | Add Streamlit quality filter controls and apply them. |
| online/etl/config.py | Add an additional default chatbot username. |
| online/api_service/static/index.html | Add quality filter checkboxes and propagate query params. |
| online/api_service/src/model.rs | Add self_authored/has_reviews fields + filter params. |
| online/api_service/src/handlers.rs | Parse new query params into FilterParams. |
| online/api_service/src/db.rs | Load pr_author + has_reviews; compute self_authored. |
| online/api_service/src/compute.rs | Enforce new filters in record_matches. |
| offline/results/openai_gpt-5.2/candidates.json | Update offline results data (new model key/entries). |
| offline/results/anthropic_claude-opus-4-5-20251101/candidates.json | Update offline results data (new model key/entries). |
| .gitignore | Ignore additional local/editor files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| cfg = DBConfig(verbose=args.verbose) | ||
| if args.database_url: | ||
| cfg.database_url = args.database_url | ||
|
|
||
| db = DBAdapter(cfg.database_url) |
There was a problem hiding this comment.
cmd_backfill_pr_author creates DBConfig but never builds/validates the GitHub token list (unlike cmd_enrich). If env vars are missing, this can run with an empty token and then write incorrect placeholders. Reuse the token-selection + validation logic from cmd_enrich before calling the backfill.
| tokens = cfg.github_tokens if cfg.github_tokens else [cfg.github_token] | ||
| pool = TokenPool(tokens) | ||
| n_tokens = pool.size | ||
| n_workers = n_tokens * 10 | ||
| logger.info(f"Using {n_tokens} GitHub token(s), {n_workers} workers") |
There was a problem hiding this comment.
tokens = cfg.github_tokens if cfg.github_tokens else [cfg.github_token] will produce [""] when neither GITHUB_TOKENS nor GITHUB_TOKEN is set (DBConfig defaults to empty string). That leads to unauthenticated API calls and, with the current logic, can mass-update rows incorrectly. Validate tokens here (non-empty strings) and fail fast if none are configured.
| # API returned but no user — mark as empty string | ||
| # so we don't re-fetch | ||
| await db.execute( | ||
| *db._translate_params( | ||
| "UPDATE prs SET pr_author = '' WHERE id = $1", |
There was a problem hiding this comment.
The “no author” path updates pr_author to an empty string to avoid re-fetching. However _fetch_pr_author returns None for transient errors and bad credentials too (because rest_get returns None in those cases), so this can permanently mask recoverable failures. Prefer leaving pr_author as NULL (or marking status/error) unless you can distinguish a non-retriable condition (e.g., 404).
| try: | ||
| pr_author, raw_data = await _fetch_pr_author( | ||
| gh, owner, repo, pr_number | ||
| ) | ||
| if pr_author is not None: |
There was a problem hiding this comment.
Inside the worker, only RateLimitExhaustedError is handled. Any other exception (DB errors, unexpected response, JSON decode) will bubble out and likely abort the whole batch, while batch_errors remains unused. Add a broader per-item exception handler that increments batch_errors (and optionally marks the row as error) so one bad PR doesn’t stop the run.
| meta["pr_created_at"] = pr_obj.get("created_at") | ||
| elif event["type"] == "IssueCommentEvent": | ||
| issue_obj = payload.get("issue", {}) | ||
| pr_obj = issue_obj.get("pull_request", {}) |
There was a problem hiding this comment.
In the IssueCommentEvent branch, pr_obj = issue_obj.get('pull_request', {}) is assigned but never used. Either remove it or use it to guard that this IssueCommentEvent is actually tied to a PR (e.g., skip if issue_obj.get('pull_request') is missing).
| pr_obj = issue_obj.get("pull_request", {}) |
| meta["pr_created_at"] = pr_obj.get("created_at") | ||
| elif event["type"] == "IssueCommentEvent": | ||
| issue_obj = payload.get("issue", {}) | ||
| pr_obj = issue_obj.get("pull_request", {}) |
There was a problem hiding this comment.
In the IssueCommentEvent branch, pr_obj = issue_obj.get('pull_request', {}) is assigned but unused. Consider removing it, or use it as an explicit guard that the comment belongs to a PR before copying title/author/created_at from the issue object.
| pr_obj = issue_obj.get("pull_request", {}) | |
| pr_obj = issue_obj.get("pull_request") | |
| # Only treat IssueCommentEvent as PR-related if the issue has a pull_request field | |
| if not pr_obj: | |
| continue |
online/etl/main.py
Outdated
| p_bfa = sub.add_parser("backfill-pr-author", help="Backfill pr_author from GitHub API") | ||
| p_bfa.add_argument("--database-url") | ||
| p_bfa.add_argument("--batch-size", type=int, default=5000) | ||
| p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)") |
There was a problem hiding this comment.
The backfill-pr-author CLI doesn’t accept --github-token/--github-tokens, and the command implementation doesn’t validate that any token is configured. This makes the command hard to use and risks running unauthenticated. Consider adding token flags (matching enrich) and failing fast if no usable token(s) are present.
| p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)") | |
| p_bfa.add_argument("--max-prs", type=int, default=None, help="Max PRs to process (default: all)") | |
| p_bfa.add_argument( | |
| "--github-token", | |
| help="GitHub token to use for API requests (single token; may be specified multiple times)", | |
| action="append", | |
| dest="github_tokens", | |
| ) | |
| p_bfa.add_argument( | |
| "--github-tokens", | |
| help="Comma-separated list of GitHub tokens to use for API requests", | |
| ) |
Git commit author names don't reliably match GitHub usernames (e.g. "Copilot" vs "copilot[bot]"), so the commits fallback is now behind --use-commits flag (off by default). Made-with: Cursor
Previously 301 Moved Permanently was treated as an error, causing all API calls for renamed repos to fail. Now follows the Location header, which GitHub provides with the stable repo ID URL. Made-with: Cursor
- ON CONFLICT now merges new BQ events with existing ones (deduplicated by event_id) instead of discarding them. This means subsequent discover runs will capture merge/close events that weren't in the original scan. - Add repo_id BIGINT column extracted from BQ repo.id (stable across renames/transfers). Future: change unique constraint to use repo_id. - Add backfill-metadata command to backfill pr_merged from pr_api_raw and repo_id from both bq_events and pr_api_raw. Made-with: Cursor
- _extract_pr_metadata: True always wins over False for pr_merged (previously a close-without-merge event could mask a later merge) - backfill-metadata now also: - Corrects pr_merged=False rows where API says merged=True - Patches assembled JSON to match corrected pr_merged values - Reports remaining gaps after backfill - Removed redundant Phase 2 (bq_events won't have repo_id yet)
New command fetches pr_api_raw from GitHub API for analyzed PRs missing it, and sets pr_merged + repo_id from the response. Uses the same async worker pool with rate limit handling. Also adds progress logging every 10k rows to all backfill-metadata phases. Made-with: Cursor
Made-with: Cursor
…ch backfill phases - Only analyze merged PRs: open/closed-without-merge PRs stay in 'assembled' until lifecycle completes. Rust API also filters. - Reset PR status to 'pending' when discover merges new BQ events, triggering re-enrich/assemble/analyze with the updated timeline. - Batch Phase 2 and Phase 3 of backfill-metadata (5k rows at a time) to avoid OOM on large datasets with big assembled/pr_api_raw blobs. Made-with: Cursor
No description provided.