Overview of Data Quality Rules
In the Transformer page, you can design data quality rules to apply to the displayed sample of your data. These data quality rules can be used to identify anomalies, completeness, uniqueness, and validity.
Additionally, rules can be defined to assess the quality of the data for its intended purpose in your data pipeline. In addition, you can use calculated metric type (derived metrics) as a source of data quality input types and create a metric-based data quality rule.
Rule Types
Metric-based rules
You can use custom metrics to assess data quality. You can use a calculated metric type (derived metrics) as a data quality input type and create a metric-based data quality rule. For example, you can create a constraint that the inventory quantity should be within a specific range.
Metric input types are supported for the following rules:
In Range
Greater Than
Less Than
Equals
Not Equals
In Set
Not In Set
Examples:
Check that all product identifiers fit a specified pattern
Verify that there are no negative values for any count columns
Validate that primary key columns contain unique values
Specify metrics in the Column value for some rule types.
For example, instead of specifying a column name such as
OrderTotal
as the input for the data quality rule, you could specify for some rule types,AVERAGE(OrderTotal)
.
Note
Metric-based rules are supported only for some metric types. For more information on the rules that support metrics, see Data Quality Rules Reference.
Note
Data quality rules are not transformation steps. They assess the current state of the sampled data in the Transformer page and can be used to assist in constructing transformation steps to improve data quality.
Note
As you apply transformation steps to the data, the state of your data quality rules is automatically updated to reflect the changes. If you delete columns or other elements referenced in the data quality rules, errors are generated in the Transformer page.
Custom rules
You can create custom rules using formulas containing Wrangle functions.
Limitations
Rules cannot be included in macros.
Rules cannot be parameterized.
Sets of rules are created for each recipe. Rules cannot be shared between recipes.
Data Quality Rule Categories
Rules break down into the following categories:
Category | Description |
---|---|
Integrity Constraints | Rule types in this category assess the validity of a column's data and any implied relationships between the data (e.g., City + State implies Zip Code) |
Pattern Matching | These rule types test whether the data in your column matches patterns that you define. |
Column Values | These rule types compare column values to limits or sets of acceptable values. In addition to column references, you can specify metric-based values. For example, you can create a constraint that the sales quantity should be within a specific range. |
Other Rules | You can also create data quality rules based on custom Wrangle formulas. |
Data quality types
Within each of the above categories, you can explore and define a variety of types of data quality rules. These rule types provide a template for creating the rule, which accepts one or more input parameters that you specify.
Creating Rules
For each recipe, you can create individualized sets of rules from within the Transformer page. In the Data Quality Rules panel, you build your data-specific rules and can review the quality bars of each rule as you continue to build your recipe.
For more information on creating rules, see Add Data Quality Rule.
Reviewing Suggestions
Through the Data Quality Rules panel, you can review a set of suggested data quality rules that are applicable to your dataset. These rules are generated based on heuristics applied to your sampled data. For more information, see Data Quality Rules Panel.
Rules Evaluation
In the Transformer page:
Rules are evaluated and displayed for the current location in the recipe. For example, if you change the location of the recipe cursor to a point earlier in the recipe, all of the defined rules are evaluated for the state of the dataset sample at that point in the recipe.
The data quality rules defined in the Transformer page are applied to the displayed sample. If your sample is not the full dataset, you should consider taking additional samples to validate the rules across other parts of your dataset.
Data Quality in Job Details
After you have successfully run your job, you can review the results of your data quality rules applied across the entire dataset in the Rules tab on the Job Details page.
Note
To display data quality results in your job details, visual profiling must be enabled for job execution.
After job execution, these rules are applied across the entire dataset and available when visual profiling is enabled.
When visual profiling is enabled for your job, the Rules tab in the Job Details page contains the results of the data quality rules for the job's recipes applied across the entire dataset.
Tip
Data quality rules are available for download in JSON and PDF format.
For more information, seeJob Details Page.