This repo contains a demo project suited to leveraging Datafold:
- dbt project that includes
- raw data (implemented via seed CSV files) from a fictional app
- a few downstream models, as shown in the project DAG below
- several 'master' branches, corresponding to the various supported cloud data platforms
master- 'primary' master branch, runs in Snowflakemaster-databricks- 'secondary' master branch, runs in Databricks, is reset to themasterbranch daily or manually when needed via thebranch_replication.ymlworkflowmaster-bigquery- 'secondary' master branch, runs in BigQuery, is reset to themasterbranch daily or manually when needed via thebranch_replication.ymlworkflowmaster-dremio- 'secondary' master branch, runs in Dremio, is reset to themasterbranch daily or manually when needed via thebranch_replication.ymlworkflow
- several GitHub Actions workflows illustrating CI/CD best practices for dbt Core
- dbt PR job - is triggered on PRs targeting the
masterbranch, runs dbt project in Snowflake - dbt prod - is triggered on pushes into the
masterbranch, runs dbt project in Snowflake - dbt PR job (Databricks) - is triggered on PRs targeting the
master-databricksbranch, runs dbt project in Databricks - dbt prod (Databricks) - is triggered on pushes into the
master-databricksbranch, runs dbt project in Databricks - dbt PR job (BigQuery) - is triggered on PRs targeting the
master-bigquerybranch, runs dbt project in BigQuery - dbt prod (BigQuery) - is triggered on pushes into the
master-bigquerybranch, runs dbt project in BigQuery - dbt PR job (Dremio) - is triggered on PRs targeting the
master-dremiobranch, runs dbt project in BigQuery - dbt prod (Dremio) - is triggered on pushes into the
master-dremiobranch, runs dbt project in BigQuery - Apply monitors.yaml configuration to Datafold app - applies monitor-as-code configuration to Datafold application
- raw data generation tool to simulate a data flow typical for real existing projects
- dbt PR job - is triggered on PRs targeting the
All actual changes should be commited to the master branch, other master-... branches are supposed to be reset to the master branch daily.
! To ensure the integrity and isolation of GitHub Actions workflows, it is advisable to create pull requests (PRs) for different 'master' branches from distinct commits. This practice helps prevent cross-PR leakage and ensures that workflows run independently.
To demonstrate Datafold experience in CI on Snowflake - one needs to create PRs targeting the master branch.
- production schema in Snowflake:
demo.core - PR schemas:
demo.pr_num_<pr_number>
To demonstrate Datafold experience in CI on Databricks - one needs to create PRs targeting the master-databricks branch.
- production schema in Databricks:
demo.default - PR schemas:
demo.pr_num_<pr_number>
To demonstrate Datafold experience in CI on BigQuery - one needs to create PRs targeting the master-bigquery branch.
- production schema in BigQuery:
datafold-demo-429713.prod - PR schemas:
datafold-demo-429713.pr_num_<pr_number>
To demonstrate Datafold experience in CI on Dremio - one needs to create PRs targeting the master-dremio branch.
- production schema in Dremio:
"Alexey S3".alexeydremiobucket.prod - PR schemas:
"Alexey S3".alexeydremiobucket.pr_num_<pr_number>
To demonstrate Datafold functionality for data replication monitoring, a pre-configured Postgres instance (simulates transactional database) is populated with 'correct raw data' (analytics.data_source.subscription_created table); the subscription__created seed CSV file contains 'corrupted raw data'.
-
Looker view, explore, and dashboard are connected to the
fct__monthly__financialsmodel in Snowflake, Databricks, and BigQuery.- Snowflake
fct__monthly__financialsviewfct__monthly__financialsexploreMonthly Financials (Demo, Snowflake)dashboard
- Databricks
fct__monthly__financials_databricksviewfct__monthly__financials_databricksexploreMonthly Financials (Demo, Databricks)dashboard
- BigQuery
fct__monthly__financials_bigqueryviewfct__monthly__financials_bigqueryexploreMonthly Financials (Demo, BigQuery)dashboard
- Snowflake
-
Tableau data source, workbook, and dashboard are connected to the
fct__yearly__financialsmodel in Snowflake, Databricks, and BigQuery.- Snowflake
FCT__YEARLY__FINANCIALS (DEMO.FCT__YEARLY__FINANCIALS) (CORE)data sourceYearly Financials (Snowflake)workbookYearly Financials Dashboard (Snowflake)dashboard
- Databricks
fct__yearly__financials (demo.default.fct__yearly__financials) (default)data sourceYearly Financials (Databricks)workbookYearly Financials Dashboard (Databricks)dashboard
- BigQuery
fct__yearly__financials (prod)data sourceYearly Financials (BigQuery)workbookYearly Financials Dashboard (BigQuery)dashboard
- Snowflake
-
Power BI table, report, and dashboard are connected to the
fct__monthly__financialsmodel in Snowflake, Databricks, and BigQuery.- Snowflake
FCT__MONTHLY__FINANCIALStableMonthly Financials SnowflakereportMonthly Financials Snowflakedashboard
- Databricks
fct__monthly__financialstablefact-monthly-financials-databricksreportFact Monthly Financials Databricksdashboard
- BigQuery
fct__monthly__financialstableMonthly Financials BigQueryreportMonthly Financials BigQuerydashboard
- Snowflake
The corresponding Datafold Demo Org contains the following integrations:
- Common
datafold/demorepository integrationPostgresdata connection for Cross-DB data diff monitorsLooker Public DemoBI app integrationPower BIBI app integrationTableau Public DemoBI app integration
- Snowflake specific
Snowflakedata connectionCoalesce-DemoCI integration for theSnowflakedata connection and themasterbranch
- Databricks specific
Databricks-Demodata connectionCoalesce-Demo-DatabricksCI integration for theDatabricks-Demodata connection and themaster-databricksbranch
- BigQuery specific
BigQuery - Demodata connectionCoalesce-Demo-BigQueryCI integration for theBigQuery - Demodata connection and themaster-bigquerybranch
- Dremio specific
Dremio-Demodata connectionCoalesce-Demo-DremioCI integration for theDremio-Demodata connection and themaster-dremiobranch
To get up and running with this project:
-
Install dbt using these instructions.
-
Fork this repository.
-
Set up a profile called
demoto connect to a data warehouse by following these instructions. You'll needdevandprodtargets in your profile. -
Ensure your profile is setup correctly from the command line:
$ dbt debug- Create your
prodmodels:
$ dbt build --profile demo --target prodWith prod models created, you're clear to develop and diff changes between your dev and prod targets.
Follow the quickstart guide to integrate this project with Datafold.
datagen/feature_used_broken.csv- copied toseeds/feature__used.csvdatagen/feature_used.csvdatagen/org_created_broken.csv- copied toseeds/org__created.csv.csvdatagen/org_created.csvdatagen/signed_in_broken.csv- copied toseeds/signed__in.csv.csvdatagen/signed_in.csvdatagen/subscription_created_broken.csv- copied toseeds/subscription__created.csv.csvdatagen/subscription_created.csv- pushed to Postgres (analytics.data_source.subscription_createdtable)datagen/user_created_broken.csv- copied toseeds/user__created.csv.csvdatagen/user_created.csvdatagen/persons_pool.csv- pool of persons used for user/org generation
datagen/data_generate.py- main data generation scriptdatagen/data_to_postgres.sh- pushes generated data to Postgresdatagen/persons_pool_replenish.py- replenishes the pool of persons using ChatGPTdatagen/data_delete.sh- deletes data for further re-generationdatagen/dremio__upload_seeds.py- uploads seed files to Dremio (due to limitations in the starndard dbt-dremio connector)
- zero on negative prices in the
subscription__createdseed - corrupted emails in the
user__createdseed (user$somecompany.com) - irregular spikes in the workday seasonal daily number of sign-ins in the
signed__inseed nullspikes in thefeature__usedseed- schema change: a 'wandering' column appears ~weekly in the
signed__inseed
- PR job fails when the 2nd commit is pushed to a PR branch targeting Databricks. Most likely related to: databricks/dbt-databricks#691.
