add Phase 0 MVP contract for onboarding first non-Transfermarkt source by dcaribou · Pull Request #345 · dcaribou/transfermarkt-datasets

dcaribou · 2026-02-16T16:54:34Z

No description provided.

Add raw_data_path() utility to centralize the data/raw/<source>/<season>/<asset>.<ext> convention. Update both acquirer scripts to use it instead of hard-coded paths. Update data/raw/.gitignore to wildcard-ignore all source directories. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

Add scripts/acquiring/openfootball.py that downloads match data from the openfootball/football.json GitHub repo into data/raw/openfootball/<season>/. Supports season ranges and configurable competition codes. Add just recipe acquire_openfootball for local execution. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

Add data/raw/openfootball.dvc for DVC tracking of acquired OpenFootball data. Add .github/workflows/acquire-openfootball.yml with weekly scheduled runs and manual dispatch. Workflow validates non-empty downloads before DVC push. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

Create dbt source definition and three staging models: - stg_openfootball_competitions: distinct competitions from JSON files - stg_openfootball_clubs: distinct clubs from match team names - stg_openfootball_games: unnested match data with deterministic game IDs All models build successfully with PK uniqueness and not-null tests. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

…es (#337) Add map_competition_ids, map_club_ids, and map_match_ids models that reconcile source-native IDs into canonical IDs. Competition mapping uses a manually maintained lookup. Club mapping uses exact and substring name matching within the same competition. Match mapping uses date + canonical club IDs. Each mapping includes confidence score and resolution method to flag ambiguous mappings. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

Update competitions, clubs, and games curated models to union Transfermarkt (primary) and OpenFootball (unmatched only) data via canonical mappings. Add source_system, source_record_id, and ingested_at provenance columns. Widen test row-count thresholds and manager_name missing tolerances to accommodate OpenFootball rows. Existing core columns preserved unchanged. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

Add docs/field-level-provenance.md specifying source precedence and derivation rules for curated fields. Add dbt/field_provenance.yml as a machine-readable artifact documenting which source field(s) contribute to each canonical field across competitions, clubs, and games models. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

Add scripts/profiling/profile_models.py that computes row counts, null ratios, cardinality, and value ranges per source/model. Compares against stored baselines and flags drift exceeding configurable thresholds (10% row decrease, 50% row increase, 5pp null ratio increase, 10% cardinality decrease). Add just recipes for running checks and updating baselines. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

Add provenance column validation step that checks source_system, source_record_id, and ingested_at exist and are populated in curated competitions, clubs, and games. Add profiling drift check step that runs before data commit/push. Both checks block the build on failure. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

Rewrite README to reflect canonical multi-source identity. Add architecture diagram showing source -> staging -> canonical -> curated pipeline. Document source lineage columns, OpenFootball as second source, quality/profiling framework, and updated acquirer/workflow tables. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

Add docs/plans/phase0-exit-report.md evaluating all five Phase 0 success criteria (functional completeness, contract correctness, quality gates, backward compatibility, operability) as ACHIEVED. Document deliverables, known gaps, and prioritized Phase 1 follow-ups. Recommendation: GO. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

Adds scripts/provenance/trace_field_lineage.py which parses compiled dbt SQL from the manifest and uses sqlglot's column-level lineage API to trace every curated model field back to its base/staging model sources. The script writes a _field_provenance table to DuckDB and creates source() and sources() table macros so users can query provenance directly in SQL: SELECT * FROM source('games', 'home_club_name'); SELECT * FROM sources('games'); Removes the manually maintained docs/field-level-provenance.md and dbt/field_provenance.yml in favor of this automated approach. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb

dcaribou and others added 14 commits February 16, 2026 17:09

add Phase 0 MVP contract for onboarding first non-Transfermarkt source

a04c47f

refactor competition ID mappings and update tests for consistency checks

e5be911

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add Phase 0 MVP contract for onboarding first non-Transfermarkt source#345

add Phase 0 MVP contract for onboarding first non-Transfermarkt source#345
dcaribou wants to merge 14 commits intomasterfrom
claude/phase-0-tasks-4kFsn

dcaribou commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dcaribou commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants