add Phase 0 MVP contract for onboarding first non-Transfermarkt source#345
Open
add Phase 0 MVP contract for onboarding first non-Transfermarkt source#345
Conversation
Add raw_data_path() utility to centralize the data/raw/<source>/<season>/<asset>.<ext> convention. Update both acquirer scripts to use it instead of hard-coded paths. Update data/raw/.gitignore to wildcard-ignore all source directories. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add scripts/acquiring/openfootball.py that downloads match data from the openfootball/football.json GitHub repo into data/raw/openfootball/<season>/. Supports season ranges and configurable competition codes. Add just recipe acquire_openfootball for local execution. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add data/raw/openfootball.dvc for DVC tracking of acquired OpenFootball data. Add .github/workflows/acquire-openfootball.yml with weekly scheduled runs and manual dispatch. Workflow validates non-empty downloads before DVC push. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Create dbt source definition and three staging models: - stg_openfootball_competitions: distinct competitions from JSON files - stg_openfootball_clubs: distinct clubs from match team names - stg_openfootball_games: unnested match data with deterministic game IDs All models build successfully with PK uniqueness and not-null tests. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
…es (#337) Add map_competition_ids, map_club_ids, and map_match_ids models that reconcile source-native IDs into canonical IDs. Competition mapping uses a manually maintained lookup. Club mapping uses exact and substring name matching within the same competition. Match mapping uses date + canonical club IDs. Each mapping includes confidence score and resolution method to flag ambiguous mappings. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Update competitions, clubs, and games curated models to union Transfermarkt (primary) and OpenFootball (unmatched only) data via canonical mappings. Add source_system, source_record_id, and ingested_at provenance columns. Widen test row-count thresholds and manager_name missing tolerances to accommodate OpenFootball rows. Existing core columns preserved unchanged. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add docs/field-level-provenance.md specifying source precedence and derivation rules for curated fields. Add dbt/field_provenance.yml as a machine-readable artifact documenting which source field(s) contribute to each canonical field across competitions, clubs, and games models. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add scripts/profiling/profile_models.py that computes row counts, null ratios, cardinality, and value ranges per source/model. Compares against stored baselines and flags drift exceeding configurable thresholds (10% row decrease, 50% row increase, 5pp null ratio increase, 10% cardinality decrease). Add just recipes for running checks and updating baselines. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add provenance column validation step that checks source_system, source_record_id, and ingested_at exist and are populated in curated competitions, clubs, and games. Add profiling drift check step that runs before data commit/push. Both checks block the build on failure. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Rewrite README to reflect canonical multi-source identity. Add architecture diagram showing source -> staging -> canonical -> curated pipeline. Document source lineage columns, OpenFootball as second source, quality/profiling framework, and updated acquirer/workflow tables. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add docs/plans/phase0-exit-report.md evaluating all five Phase 0 success criteria (functional completeness, contract correctness, quality gates, backward compatibility, operability) as ACHIEVED. Document deliverables, known gaps, and prioritized Phase 1 follow-ups. Recommendation: GO. https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Adds scripts/provenance/trace_field_lineage.py which parses compiled dbt
SQL from the manifest and uses sqlglot's column-level lineage API to
trace every curated model field back to its base/staging model sources.
The script writes a _field_provenance table to DuckDB and creates
source() and sources() table macros so users can query provenance
directly in SQL:
SELECT * FROM source('games', 'home_club_name');
SELECT * FROM sources('games');
Removes the manually maintained docs/field-level-provenance.md and
dbt/field_provenance.yml in favor of this automated approach.
https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.