Skip to content

Latest commit

 

History

History
368 lines (303 loc) · 16.2 KB

OVERVIEW.md

File metadata and controls

368 lines (303 loc) · 16.2 KB

Introduction and Overview

High-Level: What does it do and why do I need it?

CloudDQ is a solution component of Data Quality management that supports ongoing Data Quality evaluation of evolving data assets within live data environments, driven by a declarative configuration:

  • “ongoing evaluation” – repeated, frequent (in application to the same data assets) – as opposed to one-off
  • “evolving data assets” – data assets that mutate and grow over time – as opposed to constant/static data assets
  • “live data environments” – environments supporting production data operations – as opposed to migration or dev/test environments
  • “declarative configuration” – specifies intended goals (what) of DQ evaluation from an end user perspective – as opposed to imperative (how) approach

The declarative configuration specification is described in Reference Guide.

Concepts

Built-in Rule Types

The CloudDQ framework defines 5 basic types of rules:

  • NOT_NULL
  • NOT_BLANK
  • REGEX
  • CUSTOM_SQL_EXPRESSION
  • CUSTOM_SQL_STATEMENT

Of these rules, the first four support row-level validation and the last one supports set-level validation. We will begin by explaining this difference before proceeding to explain each rule type in detail.

Row-level Versus Set-level Validation

Row-level validation checks attributes of individual rows or elements of the data: For example whether the value is in a given range, or whether all fields are internally consistent. There are also validation checks that cannot be performed by just inspecting one row or element. We call this "set-level validation". Possibly the simplest example of set-level validation is a check on the total row count.

An important difference between row-level and set-level validation is that in the case of set-level validation, in general, one cannot say which rows cause a check to fail. Also, in some cases it's not possible to say how many rows fail the check. With row-level validation on the other hand, we can connect each validation error to a specific row. This difference is reflected in the CloudDQ output. For row-level validation, CloudDQ reports the number of rows validated, how many were successful, how many produced errors and how many were null. For set-level checks, CloudDQ only provides how many rows the check was run on, how many errors were generated, plus an overall success or failure flag.

The number of errors from set-level validation may or may not have a relation to a number of rows, depending on how the check is coded. For example, for a check on uniqueness, the number of errors may equal the number of duplicated rows, the number of elements that are duplicated, or something else entirely.

Of the built-in rule types, CUSTOM_SQL_STATEMENT is intended for set-level validation. That means that any check implemented using this rule type will produce output that can be expected from set-level validation.

Rule Type NOT_NULL

Violated if any row in the given column contains null.

rules:
  NOT_NULL_SIMPLE:
    rule_type: NOT_NULL

rule_bindings:
  NOT_NULL:
    entity_id: TEST_DATA
    column_id: UNIQUE_KEY
    row_filter_id: NONE
    rule_ids:
      - NOT_NULL_SIMPLE

Complete example: Rule and rule binding

Rule Type NOT_BLANK

Violated if any row in the given column contains an empty string.

rules:
  NOT_BLANK_SIMPLE:
    rule_type: NOT_BLANK

rule_bindings:
  NOT_BLANK:
    entity_id: TEST_DATA
    column_id: UNIQUE_KEY
    row_filter_id: NONE
    rule_ids:
      - NOT_BLANK_SIMPLE

Complete example: Rule and rule binding

Rule Type REGEX

For STRING columns. This rule type allows the specification of a Python regular expression to define allowed patterns.

The example shows a check for a valid email address.

rules:
  VALID_EMAIL:
    rule_type: REGEX
    params:
      pattern: |-
        ^[^@]+[@]{1}[^@]+$

rule_bindings:
  VALID_EMAIL:
    entity_id: TEST_DATA
    column_id: EMAIL
    row_filter_id: NONE
    rule_ids:
      - VALID_EMAIL

Complete example: Rule and rule binding

Rule Type CUSTOM_SQL_EXPRESSION

This rule type allows the specification of an SQL expression returning a boolean. The validation fails when the expression returns FALSE.

This rule, contrary to CUSTOM_SQL_STATEMENT supports row-level validation (see preceding section on the differences between row-level and set-level validation). This implies it has access to the row-level context of the table defined as "entity", meaning that all columns of the same row can be accessed.

The below example of this rule performs a SELECT on a reference table to compare to a list of known currencies.

The SQL expression can be parametrized using custom_sql_parameters (see the CUSTOM_SQL_LENGTH example rule and how it's used). The SQL expression can also use an implicit parameter $column, to identify the column that the condition should be applied to. The value of the parameter $column refers to a column in the referred entity. In the example below, the rule is applied to column CURRENCY in the TEST_DATA entity, which in turn refers to a column called curr in a table in BigQuery.

entities:
  TEST_DATA:
    source_database: BIGQUERY
    table_name: <table_id>
    dataset_name: <dataset_id>
    project_name: <project_id>
    columns:
      CURRENCY:
        name: curr
        data_type: STRING

rules:
  CORRECT_CURRENCY_CODE:
    rule_type: CUSTOM_SQL_EXPR
    dimension: conformance
    params:
      custom_sql_expr: |-
        $column in (select distinct currency_code from `<reference_dataset-id>.currency_codes`)

rule_bindings:
  LOCATION_LAT:
    entity_id: TEST_DATA
    column_id: CURRENCY
    row_filter_id: NONE
    rule_ids:
      - CORRECT_CURRENCY_CODE

Example rules can be found in base-rules.yml.

Rule Type CUSTOM_SQL_STATEMENT

The “custom SQL statement” allows you to specify set-level validation through a custom SQL statement returning a dataset (i.e. SELECT ... FROM ...). The statement should indicate a violation of the rule through returning a non-empty set of records.

The SQL statement references the source data as “data”. The dataset is defined through the entity_id field in the rule binding.

The SQL statement can also be parametrized using an implicit parameter column and custom SQL arguments, as shown in the example below. The rule binding sets the value of the column parameter through the column_id field and the custom parameters are set explicitly by listing them on each rule reference.

This rule type is intended for set-level validation. See the paragraph on the row-level and set-level validation at the start of this section for more information.

rules:
  ROW_COUNT:
    rule_type: CUSTOM_SQL_STATEMENT
    custom_sql_arguments:
      n_max
    params:
      custom_sql_statement: |-
        select 
          count(
              case when $column is null then 0
              else $column
              end) as n
        from data
        where n > n_max

rule_bindings:
  ROW_COUNT:
    entity_id: TEST_DATA
    column_id: TX_ID
    row_filter_id: NONE
    rule_ids:
      - ROW_COUNT:
          n_max: 1000

Complete example: Rule and rule binding

CloudDQ Execution

On each run, CloudDQ converts each rule_binding into a SQL script, which creates a corresponding view in BigQuery, and aggregates the validation results summary per rule_binding and rule for each CloudDQ run into a DQ Summary Statistics table in a BigQuery table specified in the CLI argument --target_bigquery_summary_table.

Consuming CloudDQ Outputs

CloudDQ will generate a summary table called dq_summary, and a view for each rule binding. You can visualize the results using any dashboarding solution such as Data Studio or Looker to visualize the DQ Summary Statistics table, or use the DQ Summary Statistics table for monitoring and alerting purposes.

CloudDQ reports validation summary statistics for each rule_binding and rule by appending to the target BigQuery table specified in the CLI argument --target_bigquery_summary_table. Record-level validation statistics are captured the in columns success_count, success_percentage, failed_count, failed_percentage, null_count, null_percentage. The rule type NOT_NULL reports the count of NULLs present in the input column-id in the columns failed_count, failed_percentage, and always set the columns null_count and null_percentage to NULL. CUSTOM_SQL_STATEMENT rules do not report record-level validation statistics and therefore will set the content of the columns success_count, success_percentage, failed_count, failed_percentage, null_count, null_percentage to NULL.

Set-level validation results for CUSTOM_SQL_STATEMENT rules are captured in the columns complex_rule_validation_errors_count and complex_rule_validation_success_flag. complex_rule_validation_errors_count contains the count of rows returned by the custom_sql_statement block. complex_rule_validation_success_flag is set to TRUE if complex_rule_validation_errors_count is equals to 0, FALSE if complex_rule_validation_errors_count is larger than 0, and NULL for all other rule types that are not CUSTOM_SQL_STATEMENT.

The summary table stores the results of the data quality validations, with the output summary for each combination of rule binding and rule per validation run. This output is structured in the summary table as follows:

Summary Table Description

The table below lists the columns in their dq_summary table, that is the output of a CloudDQ validation run.

Column name Description
execution_ts (timestamp) Timestamp of when the validation query was executed.
rule_binding_id (string) ID of the rule binding for which validation results are reported.
rule_id (string) ID of the rule under the rule binding for which validation results are reported.
dimension (string) Data Quality dimension of the rule_id. This value can only be one of the values specified in the rule_dimensions YAML node.
table_id (string) ID of the table for which validation results are reported.
column_id (string) ID of the column for which validation results are reported.
last_modified (timestamp) The last modified timestamp of the table_id being validated.
metadata_json_string (string) Key-value pairs of the metadata parameter content specified under the rule binding or during the data quality run.
configs_hashsum (string) The hash sum of the JSON document containing the rule binding and all its associated rules, rule bindings, row filters, and entities configurations. This allows tracking when the content of a rule_binding ID or one of its referenced configurations has changed.
dataplex_lake (string) ID of the Dataplex lake containing the table being validated.
dataplex_zone (string) ID of the Dataplex zone containing the table being validated.
dataplex_asset_id (string) ID of the Dataplex asset containing the table being validated.
dq_run_id (string) Unique ID of the record.
invocation_id (string) ID of the data quality run. All data quality summary records generated within the same data quality execution instance share the same invocation_id.
progress_watermark (boolean) Determines whether this particular record will be considered by the data quality check in determining high watermark for incremental validation. If FALSE, the respective record will be ignored when establishing the high watermark value. This is useful when executing test data quality validations which should not advance high watermark. CloudDQ populates this field with TRUE by default, but this can be overridden if the argument --progress_watermark has a value of FALSE.
rows_validated (integer) Total number of records validated after applying row_filters and any high-watermark filters on the incremental_time_filter_column_id column if specified.
complex_rule_validation_errors_count (float) Number of rows returned by a CUSTOM_SQL_STATEMENT rule.
complex_rule_validation_success_flag (boolean) Success status of CUSTOM_SQL_STATEMENT rules.
success_count (integer) Total number of records that passed validation. This field is set to NULL for CUSTOM_SQL_STATEMENT rules.
success_percentage (float) Percentage of the number of records that passed validation within the total number of records validated. This field is set to NULL for CUSTOM_SQL_STATEMENT rules.
failed_count (integer) Total number of records that failed validation. This field is set to NULL for CUSTOM_SQL_STATEMENT rules.
failed_percentage (float) Percentage of the number of records that failed validation within the total number of records validated. This field is set to NULL for CUSTOM_SQL_STATEMENT rules.
null_count (integer) Total number of records that returned null during validation. This field is set to NULL for NOT_NULL and CUSTOM_SQL_STATEMENT rules.
null_percentage (float) Percentage of the number of records that returned null during validation within the total number of records validated. This field is set to NULL for NOT_NULL and CUSTOM_SQL_STATEMENT rules.
failed_records_query (string) For every rule that fails - this column stores a query that can be utilized to get failed records. To aid debugging further - you can also specify “reference columns” in YAML that will be made part of this query. Use of reference columns will enable you to join the output of “failed_records_query” with the original data to get the entire record. E.g. you can specify a primary_key column or a compound primary_key column(s) as a reference column(s).