Overview of Data Quality
Dataprep by Trifacta provides multiple mechanisms to transform and standardize data to meet usage needs, including profile visualizations and type-based quality bars to identify potential anomalies and quality problems. Data quality checks can be applied during data import, transformation, or export in the form of visual profiling.
Broadly speaking, data quality identifies the degree to which data is usable and responsive to your use case. When you assess data quality, you are designing tests to assess its suitability for generic usage and for your specific uses.
Data Quality Characteristics
Data quality covers the following characteristics:
Completeness: values are present where they are needed and expected
Accuracy: data is substantively free of errors
Consistency: a dataset can be matched across different data sources of the enterprise
Timeliness: data values are up-to-date
Uniqueness: aggregate data are free from any duplication via filters or other transformations of source data
Validity: data are structured based on an adequate and rigorous classification system
Availability / Accessibility: data are made available to the relevant stakeholders
Traceability: the history, processing and location of the data under consideration can be easily traced
Schema Validation
Type inference
When data is imported, the Dataprep by Trifacta attempts to infer the data types in the source and to type columns in the dataset accordingly. Type inference uses the first 20-25 rows of the initial sample to assess the appropriate data type to apply to the column. For more information, see Type Conversions.
Some imported data, such as relational tables, may include schema information to identify the data type of each column. In some cases you can disable type inferencing on imported data:
Connections: As part of the definition of a connection, you can optionally choose to disable type inference. For more information, see Create Connection Window.
Per-dataset: When you import a dataset, you can modify the import settings for the selected source to disable type inference. See Import Data Page.
Assign targets
To assist in your transformation efforts, you can assign a target schema for each recipe. This target schema is super-imposed on the columns of your data. Using visual tools to review differences and select changes, you can rapidly convert the structure of your dataset in development to meet the expected target schema. For more information, see Overview of Target Schema Mapping.
Identify Anomalies
In the Transformer page, you can use the available visual tools to review the data quality characteristics of the columns in your data. These data visualizations and type-based quality bars can assist in identifying potential anomalies and quality problems.
Data quality bar
At the top of each column, you can see a data quality bar, which uses the following color coding to validate the column values against the selected column type.
Color bar | Description |
---|---|
green | Values that are valid for the current data type of the column |
red | Values that are mismatched for the column data type |
gray | Missing or null values |
Tip
Click any of the color bars to receive suggestions for transformations to add to your recipe.
Tip
You can change a column's data type in the column header. See Column Menus.
For more information, see Data Quality Bars.
Column histogram
In the column header, you can review the count and distribution of values in the column. A column's histogram can be useful for identifying anomalies or for selecting specific sets of values in the column for further exploration.
Tip
Click and drag over any set of values to receive suggestions for transformations to add to your recipe.
See Column Histograms.
Column details
Through the Column Details panel, you can explore the quality and distribution of the values in the column. The contents of the panel vary depending on the data type. For example, if the column is typed for Datetime values, then the Column Details panel includes information on the distribution of values across the days of the week and days of the month.
For all data types, you can review useful statistics on statistical quartiles, the uniqueness of values, mismatches, and outliers.
Tip
The Column Details panel is very useful for acquiring statistical information on column values in a visual format. Click any data quality bar to be prompted for suggestions of transformation steps. See Overview of Predictive Transformation.
For more information, see Column Details Panel.
Data Quality Rules
You can build rules that test aspects of data quality that are specific to your organization. These rules persist throughout your recipe development and are used to evaluate the results that you generate. For more information, see Overview of Data Quality Rules.
Standardization
You can use the Standardization tool to standardized clustered sets of column values to values that are common and consistent throughout your enterprise's data. For more information, see Overview of Standardization.
Data Quality Functions
The following functions are available for assessing data quality.
Type functions
Item | Description |
---|---|
NULL Function | The |
IFNULL Function | The |
IFMISSING Function | The |
IFMISMATCHED Function | The |
IFVALID Function | The |
ISNULL Function | The |
ISMISSING Function | The |
ISMISMATCHED Function | Tests whether a set of values is not valid for a specified data type. |
VALID Function | Tests whether a set of values is valid for a specified data type and is not a null value. |
PARSEINT Function | Evaluates a String input against the Integer datatype. If the input matches, the function outputs an Integer value. Input can be a literal, a column of values, or a function returning String values. |
PARSEBOOL Function | Evaluates a String input against the Boolean datatype. If the input matches, the function outputs a Boolean value. Input can be a literal, a column of values, or a function returning String values. |
PARSEFLOAT Function | Evaluates a String input against the Decimal datatype. If the input matches, the function outputs a Decimal value. Input can be a literal, a column of values, or a function returning String values. |
PARSEARRAY Function | Evaluates a String input against the Array datatype. If the input matches, the function outputs an Array value. Input can be a literal, a column of values, or a function returning String values. |
PARSEOBJECT Function | Evaluates a String input against the Object datatype. If the input matches, the function outputs an Object value. Input can be a literal, a column of values, or a function returning String values. |
PARSESTRING Function | Evaluates an input against the String datatype. If the input matches, the function outputs a String value. Input can be a literal, a column of values, or a function returning values. Values can be of any data type. |
Count functions
The following functions measure counts of values within a column, optionally counted by group.
Item | Description |
---|---|
Generates the count of rows in the dataset. Generated value is of Integer type. | |
Generates the count of non-null rows in a specified column, optionally counted by group. Generated value is of Integer type. | |
Generates the count of distinct values in a specified column, optionally counted by group. Generated value is of Integer type. | |
Extracts the set of unique values from a column into an array stored in a new column. This function is typically part of an aggregation. |
Aggregation functions
Item | Description |
---|---|
Computes the average (mean) from all row values in a column or group. Input column can be of Integer or Decimal. See also: | |
Computes the sum of all values found in all row values in a column. Input column can be of Integer or Decimal. | |
Computes the minimum value found in all row values in a column. Input column can be of Integer, Decimal or Datetime. | |
Computes the maximum value found in all row values in a column. Inputs can be Integer, Decimal, or Datetime. | |
Computes the mode (most frequent value) from all row values in a column, according to their grouping. Input column can be of Integer, Decimal, or Datetime type. | |
Computes the minimum value found in all row values in a Datetime column. | |
Computes the maximum value found in all row values in a Datetime column. | |
Computes the most frequent (mode) value found in all row values in a Datetime column. |
Statistical functions - single column
Variations in these functions:
Some of these functions have variations that use the sample population method of computation.
IF conditional functions can be used to compute statistical computations based on a condition.
General statistics
Item | Description |
---|---|
Computes the variance among all values in a column. Input column can be of Integer or Decimal. If no numeric values are detected in the input column, the function returns | |
Computes the standard deviation across all column values of Integer or Decimal type. | |
Computes the median from all row values in a column or group. Input column can be of Integer or Decimal. | |
Computes a specified quartile across all row values in a column or group. Input column can be of Integer or Decimal. | |
Computes a specified percentile across all row values in a column or group. Input column can be of Integer or Decimal. |
Item | Description |
---|---|
Computes the approximate median from all row values in a column or group. Input column can be of Integer or Decimal. | |
Computes an approximation for a specified quartile across all row values in a column or group. Input column can be of Integer or Decimal. | |
Computes an approximation for a specified percentile across all row values in a column or group. Input column can be of Integer or Decimal. |
Statistical functions - multi-column
Item | Description |
---|---|
Computes the covariance between two columns using the population method. Source values can be of Integer or Decimal type. | |
Computes the correlation coefficient between two columns. Source values can be of Integer or Decimal type. |
Data Quality in Job Details
When you run a job and generate results, you can review the the quality of the data of the generated output.
Visual profiling
In parallel with executing the job, you can generate a visual profile of the generated results. This visual profile provides graphical representations of the valid and mismatched values against each column's data type, as well as indications about missing values in the output.
Tip
Visual profiles can be downloaded in PDF or JSON format for offline analysis.
Visual profiling is selected as part of the job definition process. See Run Job Page.
For more information, see Overview of Visual Profiling.
For more information, see Job Details Page.