Skip to content

add Phase 0 MVP contract for onboarding first non-Transfermarkt source#345

Open
dcaribou wants to merge 14 commits intomasterfrom
claude/phase-0-tasks-4kFsn
Open

add Phase 0 MVP contract for onboarding first non-Transfermarkt source#345
dcaribou wants to merge 14 commits intomasterfrom
claude/phase-0-tasks-4kFsn

Conversation

@dcaribou
Copy link
Owner

No description provided.

dcaribou and others added 14 commits February 16, 2026 17:09
Add raw_data_path() utility to centralize the data/raw/<source>/<season>/<asset>.<ext>
convention. Update both acquirer scripts to use it instead of hard-coded paths.
Update data/raw/.gitignore to wildcard-ignore all source directories.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add scripts/acquiring/openfootball.py that downloads match data from the
openfootball/football.json GitHub repo into data/raw/openfootball/<season>/.
Supports season ranges and configurable competition codes. Add just recipe
acquire_openfootball for local execution.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add data/raw/openfootball.dvc for DVC tracking of acquired OpenFootball data.
Add .github/workflows/acquire-openfootball.yml with weekly scheduled runs
and manual dispatch. Workflow validates non-empty downloads before DVC push.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Create dbt source definition and three staging models:
- stg_openfootball_competitions: distinct competitions from JSON files
- stg_openfootball_clubs: distinct clubs from match team names
- stg_openfootball_games: unnested match data with deterministic game IDs

All models build successfully with PK uniqueness and not-null tests.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
…es (#337)

Add map_competition_ids, map_club_ids, and map_match_ids models that reconcile
source-native IDs into canonical IDs. Competition mapping uses a manually
maintained lookup. Club mapping uses exact and substring name matching within
the same competition. Match mapping uses date + canonical club IDs. Each mapping
includes confidence score and resolution method to flag ambiguous mappings.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Update competitions, clubs, and games curated models to union Transfermarkt
(primary) and OpenFootball (unmatched only) data via canonical mappings.
Add source_system, source_record_id, and ingested_at provenance columns.
Widen test row-count thresholds and manager_name missing tolerances to
accommodate OpenFootball rows. Existing core columns preserved unchanged.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add docs/field-level-provenance.md specifying source precedence and derivation
rules for curated fields. Add dbt/field_provenance.yml as a machine-readable
artifact documenting which source field(s) contribute to each canonical field
across competitions, clubs, and games models.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add scripts/profiling/profile_models.py that computes row counts, null ratios,
cardinality, and value ranges per source/model. Compares against stored baselines
and flags drift exceeding configurable thresholds (10% row decrease, 50% row
increase, 5pp null ratio increase, 10% cardinality decrease). Add just recipes
for running checks and updating baselines.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add provenance column validation step that checks source_system,
source_record_id, and ingested_at exist and are populated in curated
competitions, clubs, and games. Add profiling drift check step that
runs before data commit/push. Both checks block the build on failure.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Rewrite README to reflect canonical multi-source identity. Add architecture
diagram showing source -> staging -> canonical -> curated pipeline. Document
source lineage columns, OpenFootball as second source, quality/profiling
framework, and updated acquirer/workflow tables.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Add docs/plans/phase0-exit-report.md evaluating all five Phase 0 success
criteria (functional completeness, contract correctness, quality gates,
backward compatibility, operability) as ACHIEVED. Document deliverables,
known gaps, and prioritized Phase 1 follow-ups. Recommendation: GO.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Adds scripts/provenance/trace_field_lineage.py which parses compiled dbt
SQL from the manifest and uses sqlglot's column-level lineage API to
trace every curated model field back to its base/staging model sources.

The script writes a _field_provenance table to DuckDB and creates
source() and sources() table macros so users can query provenance
directly in SQL:

    SELECT * FROM source('games', 'home_club_name');
    SELECT * FROM sources('games');

Removes the manually maintained docs/field-level-provenance.md and
dbt/field_provenance.yml in favor of this automated approach.

https://claude.ai/code/session_014BHFU7rGYdfiZXMB4zN3Mb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants