Skip to main content

Find Bad Data

You might encounter problems with how data has been structured or formatted that you must fix prior to providing the content to your target system. You can use the methods in this section to locate problems with the content or data typing of your data.

In the Trifacta Application, it is very easy to identify where there are errors in your data. What is truly innovative is how you correct them:

  1. Identify missing or mismatched data by color-coded bars in column data.

  2. Select a bar.

  3. Suggestions are offered in a set of cards on the right panel.

  4. Click a suggestion, and immediately see the effects of the suggested transformation previewed in the data grid.

    1. If the transformation needs tweaking, you can edit the transformation as needed.

    2. If the transformation is not the correct one, click another suggestion.

  5. When satisfied, you add the transformation, and your sample of data is transformed.

CS-CleanAndEnhanceYourData-SelectErrors.png

Figure: Select errors in your data, and review AI-driven suggestions for how to correct. Make the change on the spot.

Through this series of seeing, selecting, and refining issues in your sampled data, you can address basic errors in data mismatches, missing data, non-standard values, outlier values, and much more to improve the overall consistency and quality of your data.

Find bad data

In the Transformer page, above each column of data is a data quality bar and histogram.

The top bar is the data quality bar. The data quality bar segments the values found in the column into three color-coded bands:

Color bar

Description

green

Valid values for the current data type of the column

red

Invalid values for the current data type of the column

black

Missing values could be empty or null.

CS-BadData.png

Figure: Mismatched values in a column are indicated in red

Change column data type

In the image above, you can identify the data type of the column based on the icon to the left of the column name (POS_Sales). In this case, the data type is Decimal.

In some cases, invalid data can be fixed by simply changing the column data type. You can click the current data type indicator to review and select a more appropriate data type.

Tip

You can change the data type of the column by click the data type icon for the column.

Tip

No value is invalid for the String data type.

CS-FindAndFixBadData-ChangeDataType.png

Figure: Change column data type

Find outlier data

You can explore the details of a column of data to review statistical metrics on the data and to locate outlier values. In the column menu, select Column Details.

CS-FindAndFixBadData-ColumnDetails.png

Figure: Column Details

Tip

When these bars are clicked or SHIFT-clicked, the selected values are used to prompt suggestions for how to transform them.

Tip

You can explore the patterns in the data in the Patterns tab, where you can also use these patterns to standardize the formatting of your data.

Fix Mismatched Values

When Dataprep by Trifacta evaluates a dataset sample, it interprets the values in a column against its expectations for the values. Based on the column's specified data type and internal pattern matching, values are categorized as valid, mismatched, or missing. These value categories are represented in a slender bar at the top of each column.

  • A mismatched value is any value that seems to be of a different data type than the type specified for the column. For example, if the value San Francisco appears in a column of Zip Code type, it would be marked as a mismatched value.

In the data quality bar, mismatched values are identified in red:

Tip

Before you start performing transformations on your data based on mismatched values, you should verify the data type for these columns to ensure that they are correct. The type against which values are checked is displayed to the upper left of the data quality bar. Below, the data type is ZIP for U.S. Zip code data.

BadData-DataQualityBar.png

Figure: Mismatched values in red

Mismatched values can be sourced from a variety of issues:

  • Values may be miskeyed into the source system.

  • The source system may introduce errors in output, particularly if the data is generated for export using a customized structure.

  • Incorrect use of column delimiters may create offsets among fields in individual rows.

  • Data may be badly structured across a set of rows.

  • The column may be assigned the wrong data type.

Tip

When cleaning up bad data, you should look to work from bigger problems to smaller problems. If a higher percentage of a column's values have been categorized as mismatched data, it may indicate a wider problem with the data. In affected rows, verify if other columns' values are also mismatched. These rows should be reviewed and fixed first. When fixed, other mismatches may be fixed in other rows, too.

To locate data:

Note

Remember that you are working on a sample of your data. For small datasets, the Initial Data sample includes all rows of the dataset and is unsampled.

  • From the Transformer page, click the mismatched values in a column's data quality bar to see their count, highlight them in the rows of the data grid, and trigger a set of suggestions for your review.

  • To refine the data grid view, click the Show Only Affected Rows checkbox in the status bar at the bottom of the screen. Only the rows that are affected by the previewed transform are displayed.

    Tip

    This step highlights specific values that are mismatched. You can take note of individual values.

  • To locate a specific value, click the Filters icon on the right side of the screen. In the Rows tab, enter the specific value to locate. Rows containing this value are highlighted. Back in the data grid, you can select one of these highlighted values to be prompted for suggestions.

Methods for fixing mismatched data

When you discover mismatched data in your dataset, you have the following basic methods of fixing it:

  1. Change the data type. If the percentage of mismatched rows is significant, you may need to change the data type for a better match.

  2. Replace the values with constant values. This method works if it is clear to you that the values should be a single, consistent value. Select the mismatched values in the column, and then select one of the highlighted mismatched values. Use the replace transform to change the mismatched values to corrected values.

    Tip

    One easy way to fix isolated problems with mismatched values is to highlight a mismatched value in the data grid. A new set of suggestions is displayed. You can select the replace suggestion and then modify it to include the replacement value.

  3. Set values with other columns' values. You can use the set transform to fix mismatched values by replacing them with the corresponding values from other columns.

  4. Use functions. Data can be fixed by using a function in conjunction with the set transform to replace mismatched values.

  5. Delete rows. Select the mismatched values and use the delete transform to remove the problematic rows.

  6. Hide the column for now. You can remove the column from display if you want to focus on other things. Select Hide from the column drop-down. Note that hidden columns appear in any generated output.

  7. Delete the column. If the column data is unnecessary or otherwise unusable, you can delete the entire column from your dataset. Select Delete from the column drop-down.

Tip

Delete unnecessary columns as early as possible. Less data is easier to work with in the application and improves job execution performance.

Note

You might need to review and fixed mismatched data problems multiple times in your dataset. For example, if you unnest the data, additional mismatches might be discovered. Similarly, joins and lookups can reveal mismatches in data typing.

Mismatched values in transform code

In your transforms, mismatched data can be identified references as in the following:

Transformation Name

Edit column with formula

Parameter: Columns

postal_code

Parameter: Formula

IF(ISMISMATCHED(postal_code, ['Zipcode']),'00000',postal_code)

Note that the single quotes are important around the value, which identifies the value as a constant.

Tip

In the above, note that the value Zipcode identifies the data type that is used for matching the column values. In this case, for greater specificity, you might want to identify the mismatched values in the column against the data type Integer, since all U.S. postal codes are positive integers. For more information on how to explicitly reference data types in your steps, see Valid Data Type Strings.

Trim data

To trim whitespace out of a column, use the following transformation:

Transformation Name

Edit column with formula

Parameter: Columns

column1

Parameter: Formula

TRIM($col)

The $col token is a reference to the column name to which the formula is being applied. For more information, see Source Metadata References.

This step may increase the number of missing values (for values that contain only whitespace characters) and the number of instances of matching values (for values that have spaces before and after an alphanumeric value).

You can modify the above transformation to trim leading and trailing spaces across all columns in your dataset. The wildcard (*) applies the formula to all columns in the dataset.

Transformation Name

Edit column with formula

Parameter: Columns

*

Parameter: Formula

TRIM($col)

You can extend the above transformation further by removing any leading or trailing single- and double-quote marks using the TRIMQUOTES function wrapped around the TRIM reference:

Tip

Keep in mind that nested functions are evaluated from the inside out. In this case, the TRIM function is evaluated first, which removes any surrounding whitespace. Then, the TRIMQUOTES function is applied.

Transformation Name

Edit column with formula

Parameter: Columns

*

Parameter: Formula

TRIMQUOTES(TRIM($col))

Set values using other columns

You can use values from other columns to replace mismatched values in your current column. Using the previous example, mismatched postal codes are replaced by the corresponding value in the parent entity's postal code column (parent_postal_code):

Transformation Name

Edit column with formula

Parameter: Columns

parent_postal_code

Parameter: Formula

IF(ISMISMATCHED(postal_code, ['Zipcode']),'00000',postal_code)

Use functions to fix mismatched values

In your transforms, you can insert a predefined function to replace mismatched data values. In the following example, the value for mismatched values in the score column are computed as the average of all values in the column:

Transformation Name

Edit column with formula

Parameter: Columns

score

Parameter: Formula

IF(ISMISMATCHED(score, ['Decimal']),AVERAGE(score),score)

Tip

You can also use the IFMISMATCHED function to test for mismatched values. Unlike the above construction, however, IFMISMATCHED does not support an else clause when the value does match the listed data type.

Bad data typing

Tip

Particularly for dates, data is often easiest to manage as String data type. Dataprep by Trifacta has a number of functions that you can deploy to manage strings. After the data has been properly formatted, you can change it to the proper data type. If you change data type immediately, you may have some challenges in reformatting and augmenting it. Do this step last.

For columns that have a high percentage of mismatched values, the column's data type may have been mis-assigned. In the following example, a column containing data on precipitation in inches has been mis-typed as Boolean data:

BadData-ChangeDataType.png

Figure: Mis-typed column data type

To change a column's data type, click the type identifier at the top of the column and select a new type. In this case, you would select Decimal.

Note

After you change the type, review the data quality bar again. If there are still mismatched values, review them to see if you can categorize the source of the mismatch.

As you can see in the previous example, the precipitation column contains values set to T, which may be short for true. When the data type is set to Decimal, these values now register as mismatched data. To fix, you can replace all T values with 1.0 using the set transform.

Select an instance of T in the column and click the Set suggestion card. Click Modify. For the value in the transform, enter 1.0. Your transform should look like the following:

Transformation Name

Edit column with formula

Parameter: Columns

PrecipitationIn

Parameter: Formula

IFMATCHES([PrecipitationIn], `{start}{bool}{end}`),'1.0',PrecipitationIn)

Tip

If possible, you should review and refer to an available schema of your dataset, as generated from the source system. If the data has also been mis-typed in the source system, you should fix it there as well, so any future exports from that system show the correct type.