Define set of tests that are executed on models/columns that meet condition #5644

filipjelenkovicFO · 2022-08-04T13:44:32Z

filipjelenkovicFO
Aug 4, 2022

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I would like to have an option to define set of tests that are executed on models that meet condition. This way if I would like to add another test I can do that in one place and not go over hundred of columns and add it.
Also when you run "dbt test" those test whould be aplied to all models/columns.

Describe alternatives you've considered

Currently I can manually manage set of tests but it requires great effort in managing them.

Who will this benefit?

For example apply/run all unique and not_null tests on all columns that are tagged as "primary". I might want to add test "greater then zero" test on those columns.
Also all models downstream of ModelA should have tests for recent data (my custom test).

There are certain tests that are added to models/columns of similar type/meaning/semantic. This will help in those situations

Are you interested in contributing this feature?

No response

Anything else?

No response

jtcohen6 · 2022-08-11T10:23:33Z

jtcohen6
Aug 11, 2022
Maintainer

@filipjelenkovicFO Thanks for opening the issue! I know this is one that's come up for many many folks, searching for a more ergonomic way to apply tests to many many models at once. Some recent Slack threads here + here

I think the use cases here are totally legitimate. The question is all around how we'd want to model this in dbt's internals, to expose and enable the right user experience.

Here are some possibilities I can imagine, starting linear and getting increasingly far out:

Explicit configuration

At simplest, make it possible to add tests in dbt_project.yml, like other configuration (even though tests aren't really like other configs):

models:
  subdirectory:
    # every model in this subdirectory has a column called 'id', with unique + not_null tests
    # if any model *doesn't* have a column called 'id', the tests will still run (and fail)
    +columns:
      - name: id
        tests:
          - unique
          - not_null

A better approach, and heavier lift, would be to start treating all tests—or generic tests, specifically—as actual properties of models, rather than separate nodes (what they are today). Then, we'd make it possible to configure/add tests from dbt_project.yml, or via {{ config() }} blocks within models, which could use macros to define "rule-sets."

Implicit derivation

Make it possible to define rule sets: "if this {property}, then that {generic test}." Those rules would be defined ... somewhere ... such that dbt would infer or derive the creation of tests based on another model property. Imagine: Whenever dbt finds a model with a column tagged primary_key, you've defined a rule such that it should create unique + not_null tests on that column. Then, you have the full power and expressibility of the tags config, and a single place to define your "if this then that" rules.

We'd need a place to put these rules, front-and-center, so that it isn't hidden magic which just takes effect at runtime. I hesitate to say dbt_project.yml, because so much already goes there, but that is the front page of most dbt projects. It is sort of interesting to think about putting these within the definition / configuration of each generic test, but that also really risks hiding the magic away.

Another form of implicit derivation, which you mention: If model A has unique + not_null tests defined on column id, any models downstream of model A which also include a column named id should also include those tests. Should they actually run those tests? Or just consider the column "tested"? Doing this in a smart way would, I think, require dbt to have a sense of column-level lineage between models, which it frankly doesn't have today. Lots and lots of conversation about this topic: #2995

Other ways to define project resources

In dbt today, if you want to code up one function that produces multiple project resources, that requires code generation. Lots of folks have done this, with scripts and helpers written in Python, targeting dbt code (yaml + Jinja-SQL) as their "compilation target."

It's totally possible to do this by reading in dbt project data (yaml files) or metadata (manifest.json), identifying all columns tagged primary_key, annotating that data with tests: [unique, not_null], and then writing back the yaml as project data. But it requires quite a bit of work, and it's not very fun.

What if (huge what if) we could support alternative modes of defining dbt project resources? So in addition to standard .sql + .yml + .md (+ now .py) files, in the familiar structure, where dbt will parse all those file contents into a set of Python objects that make up your project's internal representation — what if it were also possible to create project resources directly as those contracted Python objects? You might then be responsible for providing more properties, defining explicit DAG relationships, and all the rest that dbt usually takes care of behind-the-scenes — but, you'd also have the ability to interact programmatically with dbt's manifest (internal project representation), just like it does.

Inspired by some pseudo-code from @lostmygithubaccount, when we were talking about this issue the other day:

models = dbt.Manifest.models()
for model in models:
   for column in model.columns:
      if "primary_key" in column.tags:
          new_test = dbt.Test(
              test_name = "unique",
              model = model.name,
              column_name = column.name
          )
          dbt.Manifest.tests().add(new_test)

Love it? Hate it?

We're not sure if this sort of "dbt Python SDK" would be all that interesting to the average end user — but it could make a lot of sense for developers of packages and integrations, who will want to interact programmatically with dbt project contents. We already know many folks today who want to interact programmatically with dbt's runtime (tasks, hooks, logging, etc).

I don't see us making immediate progress on this one. We will continue making stepwise progress toward better defining dbt's internal constructs, so that we could imagine exposing them via a Python API.

In the meantime, this need is one that should stay on our minds. I'm going to convert this into a discussion; let's keep the conversation flowing over there.

1 reply

filipjelenkovicFO Aug 16, 2022
Author

Hi,
all possibilities sounds good. I would say that "dbt Python SDK" approach can create few more solutions to other things that may help my team and also few others (as far as I can tell by reading DBT slack channels). For example dynamically creation of larger amount of models. I can see why this is not straight forward for average user, but if you arrive at point in time when this is needed you probably already have few people in organization that can handle this. Also this can produce few more reusable packages on DBT. One thing to make this succeed would be a good documentation.
One more solution (which is somewhere between first two) would be to define macro that returns list of tests for given model/column and that macro can be overridden (and is executed for each model/column in selector). Then we could just return, for example, "not_null/unique" tests for columns that have tag "primary_key". It is implicit but it sounds like something that is in spirit of DBT.
From all solutions I would prefer explicit one, but I see how that can be a problem. But if you take unit tests into equation and probably few more requests then it might be a good thing to advance definition of tests to be first class citizens in DBT.

jaredchurch · 2023-01-21T08:04:18Z

jaredchurch
Jan 21, 2023

I'm doing something to this effect with a post-hook macro today. This means that I can define the post-hook in dbt_project.yml once for a particular directory structure and it will always run for any models in that folder.

I would prefer if I could define this as test(s) rather than a macro as that makes the coding simpler - I don't have to specifically write the error/warning code etc.

I have also in places where I generate code also generated yml files with tests in them, which works well where there is generated code, if there is hand-written code then there is more work involved in managing those yml files (as you point out, reading, modifying and then writing them). I also find that this can create an issue of "cannot see the trees for the forrest" when commiting changes to git if there is a change made to the files (I have about 400 generated sql files today, so then 400 yml files to go with them, and hence on a commit 400 files changing).

I know that all of this can be handled, but I do like the concept of define a test (either table or column level) in the dbt_project.yml as being a single place to define a test that is applied to 100's of models.

3 replies

mattppal Jan 25, 2023

@jaredchurch I'd be interested in seeing your hook, as I'd like to do something similar

jaredchurch Jan 30, 2023

Hi Matt, here is the code for the post hook macro. Our policy is that all tables have a "skey_" which must fit the unique/not null rules. This macro is implemented as a post-hook (dbt_project.yml) on any models in a particular directory structure so that it doesn't have to be coded into each model independently.

I'm not 100% certain if the first query is necessary as the others will fail as well if the column doesn't exist (I've not tested the impact on model success of a failed query in a run_query operation).

I did consider making these warnings rather than errors, but raising a warning didn't seem to do anything and I lean towards making it a hard rule causing model failure if someone cannot tell me a unique row on tables in our warehouse.

I'm currently running on dbt v1.2.1, so I've only used this code on that version.

{% macro warehouse_post_hook() %}
    {#
    -- This macro implements 3 tests to be run on all tables in warehouse schema
    --
    --  Test 1: SKEY naming convention - skey column must exist and named as skey_<tablename>
    --  Test 2: SKEY column may not contain null values
    --  Test 3: SKEY column must be unique
    --
    --  Any of these test will generate an error if they fail, preventing build of
    --  the offending model.
    --
    #}

    {# skey naming convention skey_<tablename> #}
    {% set skey='SKEY_'~this.table %}

    {# cause an error if skey does not exist #}
    select {{skey}} from {{this}} where 1=0;

    {# cause an error if skey has null results #}
    {% set result = run_query("select "~skey~",count(*) from "~this~" where "~skey~" is null group by "~skey) %}
    {% if result|length > 0 %}
        {{ exceptions.raise_compiler_error(skey~" has NULL values") }}
    {% endif %}

    {# cause an error if skey is not unique #}
    {% set result = run_query("select count(distinct "~skey~"),count(*) from "~this~" having count(distinct "~skey~") <> count(*)") %}
    {% if result|length > 0 %}
        {{ exceptions.raise_compiler_error(skey~" is not unique") }}
    {% endif %}

{% endmacro %}

mattppal Jan 30, 2023

@jaredchurch Thanks so much!! I'll go through this today :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define set of tests that are executed on models/columns that meet condition #5644

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Define set of tests that are executed on models/columns that meet condition #5644

filipjelenkovicFO Aug 4, 2022

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

Replies: 2 comments · 4 replies

jtcohen6 Aug 11, 2022 Maintainer

Explicit configuration

Implicit derivation

Other ways to define project resources

filipjelenkovicFO Aug 16, 2022 Author

jaredchurch Jan 21, 2023

mattppal Jan 25, 2023

jaredchurch Jan 30, 2023

mattppal Jan 30, 2023

filipjelenkovicFO
Aug 4, 2022

Replies: 2 comments 4 replies

jtcohen6
Aug 11, 2022
Maintainer

filipjelenkovicFO Aug 16, 2022
Author

jaredchurch
Jan 21, 2023