CloudDQ
is a solution component of Data Quality management that supports ongoing Data Quality evaluation of evolving data assets within live data environments, driven by a declarative configuration:
- “ongoing evaluation” – repeated, frequent (in application to the same data assets) – as opposed to one-off
- “evolving data assets” – data assets that mutate and grow over time – as opposed to constant/static data assets
- “live data environments” – environments supporting production data operations – as opposed to migration or dev/test environments
- “declarative configuration” – specifies intended goals (what) of DQ evaluation from an end user perspective – as opposed to imperative (how) approach
The declarative configuration specification is described in Reference Guide.
The CloudDQ framework defines 5 basic types of rules:
NOT_NULL
NOT_BLANK
REGEX
CUSTOM_SQL_EXPRESSION
CUSTOM_SQL_STATEMENT
Of these rules, the first four support row-level validation and the last one supports set-level validation. We will begin by explaining this difference before proceeding to explain each rule type in detail.
Row-level validation checks attributes of individual rows or elements of the data: For example whether the value is in a given range, or whether all fields are internally consistent. There are also validation checks that cannot be performed by just inspecting one row or element. We call this "set-level validation". Possibly the simplest example of set-level validation is a check on the total row count.
An important difference between row-level and set-level validation is that in the case of set-level validation, in general, one cannot say which rows cause a check to fail. Also, in some cases it's not possible to say how many rows fail the check. With row-level validation on the other hand, we can connect each validation error to a specific row. This difference is reflected in the CloudDQ
output. For row-level validation, CloudDQ
reports the number of rows validated, how many were successful, how many produced errors and how many were null. For set-level checks, CloudDQ
only provides how many rows the check was run on, how many errors were generated, plus an overall success or failure flag.
The number of errors from set-level validation may or may not have a relation to a number of rows, depending on how the check is coded. For example, for a check on uniqueness, the number of errors may equal the number of duplicated rows, the number of elements that are duplicated, or something else entirely.
Of the built-in rule types, CUSTOM_SQL_STATEMENT
is intended for set-level validation. That means that any check implemented using this rule type will produce output that can be expected from set-level validation.
Violated if any row in the given column contains null.
rules:
NOT_NULL_SIMPLE:
rule_type: NOT_NULL
rule_bindings:
NOT_NULL:
entity_id: TEST_DATA
column_id: UNIQUE_KEY
row_filter_id: NONE
rule_ids:
- NOT_NULL_SIMPLE
Complete example: Rule and rule binding
Violated if any row in the given column contains an empty string.
rules:
NOT_BLANK_SIMPLE:
rule_type: NOT_BLANK
rule_bindings:
NOT_BLANK:
entity_id: TEST_DATA
column_id: UNIQUE_KEY
row_filter_id: NONE
rule_ids:
- NOT_BLANK_SIMPLE
Complete example: Rule and rule binding
For STRING
columns. This rule type allows the specification of a Python regular expression to define allowed patterns.
The example shows a check for a valid email address.
rules:
VALID_EMAIL:
rule_type: REGEX
params:
pattern: |-
^[^@]+[@]{1}[^@]+$
rule_bindings:
VALID_EMAIL:
entity_id: TEST_DATA
column_id: EMAIL
row_filter_id: NONE
rule_ids:
- VALID_EMAIL
Complete example: Rule and rule binding
This rule type allows the specification of an SQL expression returning a boolean. The validation fails when the expression returns FALSE
.
This rule, contrary to CUSTOM_SQL_STATEMENT
supports row-level validation (see preceding section on the differences between row-level and
set-level validation). This implies it has access to the row-level context of the table defined as "entity", meaning that all columns of the
same row can be accessed.
The below example of this rule performs a SELECT on a reference table to compare to a list of known currencies.
The SQL expression can be parametrized using custom_sql_parameters
(see the CUSTOM_SQL_LENGTH
example rule and how it's used). The SQL expression can also use an implicit parameter $column
, to identify the column that the condition should be applied to. The value of the parameter $column
refers to a column in the referred entity. In the example below, the rule is applied to column CURRENCY
in the TEST_DATA
entity, which in turn refers to a column called curr
in a table in BigQuery.
entities:
TEST_DATA:
source_database: BIGQUERY
table_name: <table_id>
dataset_name: <dataset_id>
project_name: <project_id>
columns:
CURRENCY:
name: curr
data_type: STRING
rules:
CORRECT_CURRENCY_CODE:
rule_type: CUSTOM_SQL_EXPR
dimension: conformance
params:
custom_sql_expr: |-
$column in (select distinct currency_code from `<reference_dataset-id>.currency_codes`)
rule_bindings:
LOCATION_LAT:
entity_id: TEST_DATA
column_id: CURRENCY
row_filter_id: NONE
rule_ids:
- CORRECT_CURRENCY_CODE
Example rules can be found in base-rules.yml.
The “custom SQL statement” allows you to specify set-level validation through a custom SQL statement returning a dataset (i.e. SELECT ... FROM ...
). The statement should indicate a violation of the rule through returning a non-empty set of records.
The SQL statement references the source data as “data”. The dataset is defined through the entity_id field in the rule binding.
The SQL statement can also be parametrized using an implicit parameter column
and custom SQL arguments, as shown in the example below. The rule binding sets the value of the column
parameter through the column_id
field and the custom parameters are set explicitly by listing them on each rule reference.
This rule type is intended for set-level validation. See the paragraph on the row-level and set-level validation at the start of this section for more information.
rules:
ROW_COUNT:
rule_type: CUSTOM_SQL_STATEMENT
custom_sql_arguments:
n_max
params:
custom_sql_statement: |-
select
count(
case when $column is null then 0
else $column
end) as n
from data
where n > n_max
rule_bindings:
ROW_COUNT:
entity_id: TEST_DATA
column_id: TX_ID
row_filter_id: NONE
rule_ids:
- ROW_COUNT:
n_max: 1000
Complete example: Rule and rule binding
On each run, CloudDQ converts each rule_binding
into a SQL script, which creates a corresponding view in BigQuery, and aggregates the validation results summary per rule_binding
and rule
for each CloudDQ run into a DQ Summary Statistics table in a BigQuery table specified in the CLI argument --target_bigquery_summary_table
.
CloudDQ
will generate a summary table called dq_summary
, and a view for each rule binding. You can visualize the results using any dashboarding solution such as Data Studio or Looker to visualize the DQ Summary Statistics table, or use the DQ Summary Statistics table for monitoring and alerting purposes.
CloudDQ
reports validation summary statistics for each rule_binding
and rule
by appending to the target BigQuery table specified in the CLI argument --target_bigquery_summary_table
. Record-level validation statistics are captured the in columns success_count
, success_percentage
, failed_count
, failed_percentage
, null_count
, null_percentage
. The rule type NOT_NULL
reports the count of NULL
s present in the input column-id
in the columns failed_count
, failed_percentage
, and always set the columns null_count
and null_percentage
to NULL
. CUSTOM_SQL_STATEMENT
rules do not report record-level validation statistics and therefore will set the content of the columns success_count
, success_percentage
, failed_count
, failed_percentage
, null_count
, null_percentage
to NULL
.
Set-level validation results for CUSTOM_SQL_STATEMENT
rules are captured in the columns complex_rule_validation_errors_count
and complex_rule_validation_success_flag
. complex_rule_validation_errors_count
contains the count of rows returned by the custom_sql_statement
block. complex_rule_validation_success_flag
is set to TRUE
if complex_rule_validation_errors_count
is equals to 0, FALSE
if complex_rule_validation_errors_count
is larger than 0, and NULL
for all other rule types that are not CUSTOM_SQL_STATEMENT
.
The summary table stores the results of the data quality validations, with the output summary for each combination of rule binding and rule per validation run. This output is structured in the summary table as follows:
The table below lists the columns in their dq_summary
table, that is the output of a CloudDQ validation run.
Column name | Description |
---|---|
execution_ts |
(timestamp) Timestamp of when the validation query was executed. |
rule_binding_id |
(string) ID of the rule binding for which validation results are reported. |
rule_id |
(string) ID of the rule under the rule binding for which validation results are reported. |
dimension |
(string) Data Quality dimension of the rule_id. This value can only be
one of the values specified in the rule_dimensions YAML node. |
table_id |
(string) ID of the table for which validation results are reported. |
column_id |
(string) ID of the column for which validation results are reported. |
last_modified |
(timestamp) The last modified timestamp of the table_id
being validated. |
metadata_json_string |
(string) Key-value pairs of the metadata parameter content specified under the rule binding or during the data quality run. |
configs_hashsum |
(string) The hash sum of the JSON document containing the rule binding
and all its associated rules, rule bindings, row filters, and entities
configurations. This allows tracking when the content of a
rule_binding ID or one of its referenced configurations has
changed. |
dataplex_lake |
(string) ID of the Dataplex lake containing the table being validated. |
dataplex_zone |
(string) ID of the Dataplex zone containing the table being validated. |
dataplex_asset_id |
(string) ID of the Dataplex asset containing the table being validated. |
dq_run_id |
(string) Unique ID of the record. |
invocation_id |
(string) ID of the data quality run. All data quality summary records
generated within the same data quality execution instance share the same
invocation_id . |
progress_watermark |
(boolean) Determines whether this particular record will be considered
by the data quality check in determining high watermark for incremental
validation. If FALSE, the respective record will be ignored when
establishing the high watermark value. This is useful when executing
test data quality validations which should not advance high watermark.
CloudDQ populates this field with TRUE by default, but
this can be overridden if the argument --progress_watermark
has a value of FALSE .
|
rows_validated |
(integer) Total number of records validated after applying
row_filters and any high-watermark filters on the
incremental_time_filter_column_id column if specified. |
complex_rule_validation_errors_count |
(float) Number of rows returned by a CUSTOM_SQL_STATEMENT
rule. |
complex_rule_validation_success_flag |
(boolean) Success status of CUSTOM_SQL_STATEMENT rules.
|
success_count |
(integer) Total number of records that passed validation. This field
is set to NULL for CUSTOM_SQL_STATEMENT rules.
|
success_percentage |
(float) Percentage of the number of records that passed validation
within the total number of records validated. This field is set to
NULL for CUSTOM_SQL_STATEMENT rules. |
failed_count |
(integer) Total number of records that failed validation. This field
is set to NULL for CUSTOM_SQL_STATEMENT rules.
|
failed_percentage |
(float) Percentage of the number of records that failed validation
within the total number of records validated. This field is set to
NULL for CUSTOM_SQL_STATEMENT rules. |
null_count |
(integer) Total number of records that returned null during validation.
This field is set to NULL for NOT_NULL and
CUSTOM_SQL_STATEMENT rules. |
null_percentage |
(float) Percentage of the number of records that returned null during
validation within the total number of records validated. This field is
set to NULL for NOT_NULL and
CUSTOM_SQL_STATEMENT rules. |
failed_records_query |
(string) For every rule that fails - this column stores a query that can be utilized to get failed records. To aid debugging further - you can also specify “reference columns” in YAML that will be made part of this query. Use of reference columns will enable you to join the output of “failed_records_query” with the original data to get the entire record. E.g. you can specify a primary_key column or a compound primary_key column(s) as a reference column(s). |