Skip to content

Add functional_dependency test (PR included) #1020

@gatewaycat

Description

@gatewaycat

See PR #1019

Describe the feature

Add a functional_dependency test:

models:
  - name: orders
    columns:
      - name: customer_name
        tests:
        - dbt_utils.functional_dependency:
            depends_on:
              - customer_id

One column is functionally dependent on some other column(s) in a table if each distinct combination of those other column(s) always leads to the same (possibly null) value in our one column.

Example. In this table, customer_name is functionally dependent on customer_id. The important part is that both rows for customer 2 have the same name "Brock."

order_id customer_id customer_name
1001 1 Ash
1002 2 Brock
1003 2 Brock
1004 3 Ash
1005 4


Describe alternatives you've considered

(1) Practitioners make their own custom generic test. However, functional dependency is pretty root-level.

(2) Practitioners add a normalizing table and test that downstream table for uniqueness. For example, materialize a table using select distinct customer_id, customer_name from orders and check customer_id for uniqueness.

(3) Practitioners adapt a dbt_expectations test as a workaround. Functional dependency can be considered "constant within groups" aka "less than 2 distinct within groups." Metabase's dbt_expectations has an appropriate test. However, using a workaround probably obscures what we're testing.

models:
  - name: orders
    columns:
      - name: customer_name
        tests:
          - dbt_expectations.expect_column_distinct_count_to_be_less_than:
              value: 2
              group_by:
                - customer_id

(4) dbt Utils adds an is_constant test so it can be adapted as a workaround. We can't do the dbt_expectations workaround in dbt_utils alone at present, because we don't have a is_constant test. We could add an is_constant test if that seems useful (or extend another test to cover that use case), and then build into it group_by functionality. However, an is_constant test isn't too useful otherwise.

Additional context

Not database-specific.

Who will this benefit?

Any dbt practitioner with messy source data!

This test is often useful for denormalized source data, where logical relationships between fields are implicitly expected but don't always hold, due to manual entry errors, or merges from different systems. Broken functional dependencies often surface as dupes and other anomalies downstream.

Even when downstream dupes are due to an "error in a model," the error might be that the model's logic assumes a functional dependency upstream that used to hold. For example, the doctor misspells a patient's name in 1 visit log, and now various models are doubling all that patient's rows.

Are you interested in contributing this feature?

Yes! Here's the PR: #1019

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions