Skip to main content

Validate Column Values against a Dataset

When needed, you can validate a column of values against a pre-defined set of values maintained in a separate dataset. This method of data validation is most useful for String-based data that does not easily map to a specific pattern of values.

Overview

This method of validation is completed through the following general steps:

  1. Prepare your validation dataset. Create a dataset containing the unique values against which you wish to validate.

  2. Import your validation dataset. After you have prepared the dataset externally, you should import it into the Cloud Portal.

  3. Join your data to your validation dataset. You perform a join from the dataset you're wrangling to the validation dataset that you imported. Validation errors should be identifiable as missing values in the validation column.

  4. Triage defects as necessary.For rows that cannot be resolved, additional wrangling may be necessary.

  5. Standardize data. You can use the Standardize tool to review the differences between invalid data and valid data.

Example

This approach is best demonstrated by example. Below, you can see a set of orders for product.

productName

customerName

Qty

totalSales

Product ADA

Customer ABC

2

26

Product AEV

Customer DEF

4

100

Product DXL

Customer EFG

6

42

Product EDM

Customer ABC

1

26

Product JTO

Customer DEF

3

75

Product JUB

Customer EFG

5

35

Product NRS

Customer ABC

6

26

product NSE

Customer DEF

8

200

Product ZZZ

Customer EFG

10

80

Notes:

  • You can see that this set of orders is spread across 10 different products for three different customers.

  • In the productName column, there is a mismatch in capitalization.

  • The final productName value (Product ZZZ) does not exist.

The product names in this list must be validated against a dataset containing the list of all available products. This list is 100 product names long.

You can use the links below to download these datasets as CSV files for exploration in your project or workspace.

Prepare Validation Dataset

If you haven't done so already, you should prepare your validation dataset for use in the Cloud Portal. Below, you can see the first 10 rows of the ProductNames dataset:

productName

Product ADA

Product AEV

Product ANH

Product ARA

Product ARM

Product AUJ

Product BAD

Product BAP

Product BEI

Product BEZ

Notes:

  • A column header is provided in the dataset. This is helpful for identifying the column to use later as the join key.

  • You may wish to enter a validation column, simply contains the value TRUE.

    Tip

    If your dataset does not contain this column, you can create a new formula within the Transform Builder to insert this value. This step is covered later.

Import Validation Dataset

If you have prepared your dataset, you must import into in the application.

Steps:

  1. In the Cloud Portal, click Library for Data.

  2. Click Import Data.

  3. Navigate to the file or files to import.

    Tip

    If you are using the example files, you can right-click them above, download them to your desktop, and then drag and drop them into the Import Data page.

  4. Import the file as a new dataset.

    Tip

    If you are using the example datasets, you can call it Reference-ProductNames.

  5. It may be helpful to import the file into a new file and create a recipe from it.

For the example dataset, there is a single column of values. To make this dataset useful as a validation dataset, add the following transformation, which adds a second column called validation containing the value true for each row.

Transformation Name

New formula

Parameter: Formula type

Single row formula

Parameter: Formula

'true'

Parameter: New column name

validation

The example reference dataset now looks like the following:

productName

validation

Product ADA

TRUE

Product AEV

TRUE

Product ANH

TRUE

Product ARA

TRUE

Product ARM

TRUE

Product AUJ

TRUE

Product BAD

TRUE

Product BAP

TRUE

Product BEI

TRUE

Product BEZ

TRUE

Join with Validation Dataset

Now you can join your existing dataset with the new validation dataset. A join performs a comparison of the column values in one dataset compared to the column values in another dataset. Where matches are detected, columns and values of the joined-in dataset are inserted into the source dataset. For more information on joins, see Join Types.

Steps:

  1. Edit the recipe of the source dataset.

  2. As a new step for the recipe, enter join datasets in the Search panel.

  3. Select the source of the joined-in data:

    1. If you created a recipe and added steps (as in the example), then click the Recipes in current flow tab.

    2. If you imported a clean dataset, then click one of the Datasets tabs.

  4. For Join type, select Left.

    Tip

    A left join includes all rows from the left (source) dataset and only the matching rows from the right dataset for a specified set of column values (join keys) in the left dataset. If a column value in the left dataset does not exist in the right dataset, then null values are listed for that row's entry for all columns imported from the right dataset.

  5. For the Join keys, select the column containing values to check from the left (source) dataset and then column containing the reference values in the right dataset.

    Tip

    In the example datasets, both of these columns are called ProductNames, which assists the join tool in identifying the join key columns.

  6. Under Join Keys, hover over one of the column names. Then, click the Pencil icon.

    ValidateColumnValuesagainstaDataset-EditJoinKeys.png

    Figure: Edit join keys

  7. When you edit the join keys, you can specify the Condition, which defines the type of comparison that is performed to determine a match.

  8. The other options allow you to fine-tune how matching is performed. In particular, the Ignore case option is off, which means that by default, joins are case-sensitive. So, product 01 does not match to Product 01.

    Tip

    In the example data, you can see that the product NSE entry does not have a match, which is due to differences in case. If Ignore case is enabled, then this entry may find a match. However, you may wish to maintain case-sensitive searches to ensure that you can clean up the data correctly.

  9. Click Save and continue.

  10. Click Next.

  11. Select all columns, and click Review.

    ValidateColumnValuesagainstaDataset-ReviewJoin.png

    Figure: Review join

    Tip

    In the example, you can see that two rows failed to match.

  12. Click Add to recipe.

Triage Invalid Data

You should now have a dataset containing all columns from both datasets.

Tip

In the example dataset, a second column called validation was added. This column contains null values for the mismatched rows. So, you can delete the duplicate ProductName column, which contains null values in two rows.

You can make some decisions on how to triage the invalid values in the ProductNames column.

Insert error messages

You can use a transformation to replace the null values in the validation column with a meaningful message:

Transformation Name

Edit column with formula

Parameter: Columns

validation

Parameter: Formula

IF ($col == NULL(),'Error - invalid Product Name','ok')

The transformation replaces the null values with Error - invalid Product Name and writes ok for the other rows.

Delete invalid rows

If the entire row of data is invalid because of the invalid value, then you simply delete the row. This transformation deletes rows, where the validation column contains a null value:

Transformation Name

Filter rows

Parameter: Condition

Custom formula

Parameter: Type of formula

Custom single

Parameter: Condition

(validation == NULL())

Parameter: Action

Delete matching rows

Update validation dataset with new values

If you discover that some of the mismatched rows should be part of your validation dataset, you can follow these general steps to add them.

Steps:

  1. In Flow View, add a new recipe off of the recipe where the join occurs.

  2. Edit this new recipe.

  3. Insert this transformation to filter the dataset down to the rows containing the new values:

    Transformation Name

    Filter rows

    Parameter: Condition

    Custom formula

    Parameter: Type of formula

    Custom single

    Parameter: Condition

    (validation == NULL())

    Parameter: Action

    Keep matching rows
  4. You can then delete all columns except (in the example) the ProductName column. From the column menu of the ProductName column, select Delete others.

  5. You should now have a single column containing the missing values. You can create a union between this dataset and the validation dataset to add the values. See Append Datasets.

  6. Run a job on this recipe to generate the output file.

  7. This output file can then be used to replace the source data. In Flow View, select the imported dataset. From the context panel, open the More menu and select Replace.

Standardize invalid data

You can standardize values in join key column for your source data, which may address some of the invalid data. The Standardization tool attempts to cluster similar values together within the column, so that you can try to identify if your values in your source dataset can be matched to values in the target dataset.

Tip

After you have identified issues, this step may be best to apply before the join. By applying it before the join, some or all of the mismatched value issues may be addressed.

For your source column, you can select Standardize from the column menu.

In the example, the ProductNames column values are being standardized. In this case, the lower case product NSE value has been corrected to be Product NSE.

ValidateColumnValuesagainstaDataset-Standardize.png

Figure: Standardize join key values

For more information, see Overview of Cluster Clean.