Skip to main content

Add Data Quality Rule

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

You can create rules to validate the quality of the data in your sample. When created, these rules allow you to highlight exceptions to the rule to assist in building your data cleansing recipe steps.

  • A data quality rule evaluates the values in one or more columns against a test criteria that you define.

    • Designer Cloud includes a library of pre-defined data quality rule types.

    • You can also create a custom rule using functions in the language.

  • Data quality rules are one of several features available for monitoring data quality during import, transformation, and export of your datasets. For more information, see Overview of Data Quality.

Note

Data quality rules are not transformation steps. They assess the current state of the sampled data in the Transformer page.

Note

As you apply transformation steps to the data, the state of your data quality rules is automatically updated to reflect the changes. If you delete columns or other elements referenced in the data quality rules, errors are generated in the Transformer page.

Add Rule

You can add a rule from inside the Transformer page.

Steps:

  1. You create rules inside the Transformer page. In the toolbar at the top of the screen, click the Data Quality Rules icon on the right side of the toolbar.

  2. The Data Quality Rules panel opens in the context panel.

  3. If you have not created any rules, the panel is empty. To create a new rule, click Add rule.

    Tip

    You can review a set of suggested data quality rules that are based on your dataset and add them as needed. Click View suggestions.

  4. The available types of data quality rules are displayed. Select your rule type.

    1. A simple one is Not Null. See Examples below.

    2. You can also add custom rules based on formulas that you specify. See "Add Custom Rule" below.

  5. May be missing: Some rule types support the May be missing checkbox. When it is enabled, the Data Quality rule allows missing values to be acceptable for a specified column.

    Note

    The May be missing rule parameter is not applicable to Not Null, Not Missing, Not Equal, Not In Set, and Formula rule types.

  6. Select the column or columns to which the rule applies.

    Tip

    Some rules can be applied to multiple columns.

  7. Click Add.

  8. Specify the other parameters as needed.

  9. Review the previewed results.

    Tip

    To simplify the preview, click the Show Only Affected Columns checkbox in the status bar.

  10. When finished, click Add to add the rule.

The new rule is displayed in the Data Quality Rules panel. In the data quality bar for the rule, green indicates the row values that have passed the rule, and red indicates the row values that failed.

  • Hover over either color to see the row counts and percentage.

  • Select either color to highlight the indicated rows in the data grid.

Tip

After creating a rule, you can jump back and forth between the Recipe panel and this panel to review how your changes to your recipe steps affect the data quality bars for your rules.

Additional options are available in the context menu for the rule.

Examples

Example - storeAddress column is Not Missing

The following data quality rule tests the values in the storeAddress column to see if they are missing (empty) values.

Data Quality Rule

Not Missing

Parameter: Column

storeAddress

Example - primaryKey column is Unique

The following rule evaluates the primaryKey column to determine if all values in it are unique.

Data Quality Rule

Unique

Parameter: Column

primaryKey

Example - SKU column matches pattern of SKU + 6 digits

Suppose the values of your SKUs must be in the form of "SKU + 6 digits".

Following uses Wrangle to perform the match. For more information on Wrangle , see Text Matching.

Data Quality Rule

Match

Parameter: Column

SKU

Parameter: Matches pattern

`SKU{digit}{6}`

Parameter: Ignore case

false

Example - orderColor must be "Blue", "Yellow" or "Green"

This rule tests the values in the orderColor column to verify that all values are Blue, Yellow, or Green.

In the following, the Acceptable values must be formatted as an array. See below.

Data Quality Rule

In Set

Parameter: Column

orderColor

Parameter: Acceptable values

['Blue','Yellow','Green']

Add Metric-Based Rule

For some rule types, you can create a data quality rule using custom metrics to assess the data quality. You can use the calculated metric type (derived metrics) as a data quality input type and create a metric-based data quality rule.

For example, you can create a metric-based rule to find out if the minimum value metric is within the acceptable range. To do this, you can select a In Range data quality rule, select Minimum input metric, and other required parameters to create a metric-based rule.

Note

Metric-based rules are supported only for some metric types.

Steps:

  1. You create rules inside the Transformer page. In the toolbar at the top of the page, click the Data Quality Rules icon on the right side of the toolbar.

  2. The Data Quality rules panel opens in the context panel.

  3. If you have not created any rules, the panel is empty. To create a new rule, click Add rule.

  4. The available types of data quality rules are displayed.

  5. Select one the rule types types that support metrics. For example, select In Range.

  6. Select the required input column value from the Input type drop-down. For example, select Average.

  7. Select the required column for which you want to add a metric-based rule. Additional examples are below.

  8. For Group by rows, you can select the column whose individual values are used to group the calculated metric rule. For example, if you group a maximum order by product identifier column, then the rule is calculated for individual product identifiers.

    Note

    If you do not group by row values, then the metric is calculated over all rows in the dataset. For large datasets, calculating flat aggregates can impact performance.

  9. Enter minimum and maximum values.

    1. Minimum value: Indicates the minimum (lower bound) of the range.

    2. Maximum value: Indicates the maximum (upper bound) of the range.

      Note

      For Greater than and Less than rule types, Minimum value and Maximum value options are displayed.

  10. Exclude minimum and maximum from range: When this option is enabled, the Data Quality rule excludes the minimum and maximum values from the acceptable range.

    Note

    The Exclude minimum and maximum from range rule parameter is applicable only to In Range, Greater than, and Less than rule types.

  11. May be missing: When this checkbox is selected, the Data Quality rule allows missing values to be acceptable for a specified column.

  12. Click Add. The metric-based data quality rule is added.

The new rule is displayed in the Data Quality Rules panel. In the data quality bar for the rule, the green bar indicates the row values that have passed the rule, and the red bar indicates the row values that failed.

  • Hover over either color to see the row counts and percentage.

  • Select either color to highlight the indicated rows in the data grid.

Additional options are available in the context menu for the rule.

Examples

Example - Metrics for In Range rule type

This rule checks if the values in the POS_SALES column fall between the minimum and maximum range.

Data Quality Rule

In Range

Parameter: Input type

Minimum

Parameter: Column

POS_SALES

Parameter: Group rows by

POS_QTY

Parameter: Minimum Value

3

Parameter: Maximum Value

7

When this rule is added, the rows that do not fall within the range are flagged in red.

Example - Metrics for Greater than rule type

This rule checks if the values in the POS_QTY column are greater than the minimum value.

Data Quality Rule

Greater than

Parameter: Input type

Average

Parameter: Column

POS_QTY

Parameter: Group rows by

POS_SALES

Parameter: Minimum Value

10.4

When this rule is added, the rows that do not fall within the range are flagged in red.

Example - Metrics for Less than rule type

This rule checks if the values in the NET_SHIP_QTY column are less than the maximum value.

Data Quality Rule

Less than

Parameter: Input type

Maximum

Parameter: Column

NET_SHIP_QTY

Parameter: Group rows by

POS_QTY

Parameter: Maximum Value

24

When this rule is added, the rows that do not fall within the range are flagged in red.

Tip

You can use the available checkboxes to include the missing values and to exclude the minimum and maximum values from range.

Add Custom Rule

You can add custom rules using formulas containing Wrangle functions.

Wrangle is the proprietary language used to transform your data. You can also apply the functions of the language to your data quality rules. For more information, see Wrangle Language.

Steps:

  1. In the Data Quality Rules panel, click Add Rule.

  2. Under Other Rules, select Formula.

  3. In the Formula textbox, enter the Wrangle formula to test your data.

    Note

    The formula that you provide must evaluate to true or false. true values are highlighted in green in the data quality bar for the rule.

  4. For aggregation functions, you can group the evaluation of your rule based on the values in your grouping column.

    Tip

    You can group by multiple columns. The first column is the outermost grouping.

  5. To add the rule, click Add.

Examples

Example - sum of daily sales >= 100

You can use data quality rules to perform some data analysis functions. For example, suppose you want to flag the dates where the total sales of all of your orders was less than 100.

Data Quality Rule

Formula

Parameter: Formula

SUM(mySales) >= 100

Parameter: Group

myDate

When this rule is added, the rows whose date total is less than 100 are flagged in red.

Edit Rule

To edit a rule, select Edit rule from the context menu for the rule in the panel.

Delete Rule

To delete a rule, select Delete rule from the context menu for the rule in the panel.

Export Rules

Job results

When you generate a profile as part of your job results, you can download the profile in JSON or PDF format from the Job Details page.

When you download the profile in JSON format, the set of rules for the job are also included. Search for profilerRules in the JSON file.

Flows

When flows are exported and imported, the rule definitions for the recipes in the flow are also exported.

Using Rules

In the data quality rule bar, a green bar indicates that the row values passed the rule check, and a red bar indicates that the row values are failed. You can hover over the displayed color to see the row counts and percentage.

Context menu:

The following context menu options are available when you create a rule:

  • Edit rule: Edit the data quality rule.

  • Delete rule: Delete the data quality rule.

    Note

    Deleting a data quality rule does not affect your data.

  • Show Failing Values Only: Highlight the values that have failed the rule.

  • Show Passing Values Only: Highlights the values that have passed the rule

  • Clear Preview: Removes the value highlighting from the data grid.