-
Notifications
You must be signed in to change notification settings - Fork 587
Description
See PR #1019
Describe the feature
Add a functional_dependency test:
models:
- name: orders
columns:
- name: customer_name
tests:
- dbt_utils.functional_dependency:
depends_on:
- customer_id
One column is functionally dependent on some other column(s) in a table if each distinct combination of those other column(s) always leads to the same (possibly null) value in our one column.
Example. In this table, customer_name
is functionally dependent on customer_id
. The important part is that both rows for customer 2 have the same name "Brock."
order_id | customer_id | customer_name |
---|---|---|
1001 | 1 | Ash |
1002 | 2 | Brock |
1003 | 2 | Brock |
1004 | 3 | Ash |
1005 | 4 |
Describe alternatives you've considered
(1) Practitioners make their own custom generic test. However, functional dependency is pretty root-level.
(2) Practitioners add a normalizing table and test that downstream table for uniqueness. For example, materialize a table using select distinct customer_id, customer_name from orders
and check customer_id
for uniqueness.
(3) Practitioners adapt a dbt_expectations test as a workaround. Functional dependency can be considered "constant within groups" aka "less than 2 distinct within groups." Metabase's dbt_expectations has an appropriate test. However, using a workaround probably obscures what we're testing.
models:
- name: orders
columns:
- name: customer_name
tests:
- dbt_expectations.expect_column_distinct_count_to_be_less_than:
value: 2
group_by:
- customer_id
(4) dbt Utils adds an is_constant test so it can be adapted as a workaround. We can't do the dbt_expectations workaround in dbt_utils alone at present, because we don't have a is_constant test. We could add an is_constant test if that seems useful (or extend another test to cover that use case), and then build into it group_by functionality. However, an is_constant test isn't too useful otherwise.
Additional context
Not database-specific.
Who will this benefit?
Any dbt practitioner with messy source data!
This test is often useful for denormalized source data, where logical relationships between fields are implicitly expected but don't always hold, due to manual entry errors, or merges from different systems. Broken functional dependencies often surface as dupes and other anomalies downstream.
Even when downstream dupes are due to an "error in a model," the error might be that the model's logic assumes a functional dependency upstream that used to hold. For example, the doctor misspells a patient's name in 1 visit log, and now various models are doubling all that patient's rows.
Are you interested in contributing this feature?
Yes! Here's the PR: #1019