Overview of Schema Management

A schema refers to the sequence and data type of columns in a dataset. Schemas are applicable to relational tables and some file formats. This section provides an overview of how Dataprep by Trifacta enables the capture and tracking of changes of input schemas as well as the methods available for transforming your data to match a target schema.

Overview of Schemas

A schema is a skeleton structure that represents the logical view of the dataset. The dataset can be a file, table, or a SQL query in a database. A schema defines how the data is structured and organized. Schema information includes:

Column names
Column ordering
Column data types

Schemas may apply to relational tables and schematized file formats such as Avro and Parquet.

Input type conversions

Depending on the data source, Dataprep by Trifacta can read in native data types into Alteryx data types. For more information, see Type Conversions.

Schema Validation

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

Over time, schema sources may change in major and minor ways, often without warning. From within the Trifacta Application, schema changes may appear as broken recipe steps and can cause data corruption downstream. Schema validation can be applied to:

relational datasets (tables and views)
schematized files (e.g. Parquet)
file-based datasets (e.g. CSV files)

To assist with these issues, the Trifacta Application can be configured to monitor schema changes on your dataset. Schema validation performs the following actions on your dataset:

On read, the schema information from the dataset is captured and stored separately in the Alteryx database. This information identifies column names, data types, and ordering of the dataset.
When the dataset is read during job execution, the new schema information is read and compared to the stored version, which enables identification of changes to the dataset.
Tip
This check occurs as the first step of the job execution process and is labeled as Schema validation.
You can configure the Trifacta Application to halt job execution when schema validation issues have been encountered.
Tip
Configuration settings can be overridden for individual jobs.

Limitations

Refreshing a file-based dataset with parameters:

In a set of parameterized files, the first detected file is checked for schema. This schema is stored for reference. The other files are assumed to contain the exact same schema.
If there are changes in the schema of the first file, the other files are assumed to have those changes, too. If they do not, then there can be problems during sampling or transformation.
If the first file is renamed, moved, or deleted, a status code 404 error may be detected during schema validation. However, the job may be able to complete as expected.
Tip
If schema validation is failing due to any of the above changes, you can address the issue by recreating the dataset with parameters.

Enable

Settings

At the project or workspace level, an administrator can set the default settings for outputs to validate schemas or not.

Tip

Workspace-level defaults can be overridden at the job level, even if the workspace-level settings are disabled. For more information, see Run Job Page.

For more information, see Dataprep Project Settings Page.

File settings

During the creation of an imported dataset, you can configure the following settings for schema validation:

Steps:

After a file has been selected in the Import Data page, click Edit settings.

In the Edit settings dialog:

Setting	Effects on schema validation
Detect structure	When enabled, the structure of the first chunk from the imported dataset is used for determining the schema of the dataset. Note If the imported dataset is composed of multiple files, only the first file is used for schema validation purposes. If there are changes to the schema of the second or later files, they are undetected. When disabled, the structure of the file is ignored, and all data is imported as a single column. Schema validation is effectively disabled for the dataset.
Infer header	The first row of data is used as the column headers.
No headers	Default column names are used in the stored schema: `column1` , `column2`, and so on.

Setting

Effects on schema validation

Detect structure

When enabled, the structure of the first chunk from the imported dataset is used for determining the schema of the dataset.

Note

If the imported dataset is composed of multiple files, only the first file is used for schema validation purposes. If there are changes to the schema of the second or later files, they are undetected.

When disabled, the structure of the file is ignored, and all data is imported as a single column. Schema validation is effectively disabled for the dataset.

Infer header

The first row of data is used as the column headers.

No headers

Default column names are used in the stored schema: column1 , column2, and so on.

For more information, seeFile Import Settings.

Use

When a job is launched, the schema validation check is performed in parallel with the data ingestion step. Schema validation checks for:

Changes to the order of columns
Columns that have been deleted
Columns that have been added

The results of the schema validation check are reported in the Job Details page in the Schema validation stage.

Note

Jobs may be configured to fail if schema validation checks fail. If jobs are not configured to fail, jobs may complete with warnings and publish output data to the specified targets, when schema validation fails.

For more information, see Job Details Page.

When schema validation detects differences in the Job Details page, those findings can be explored in detail. See Schema Changes Dialog.

Job-level overrides

You can override the project or workspace level settings for schema validation for individual jobs. For more information, see Run Job Page.

Job results

In the Job Details page, you can review schema validation checks for the datasets in the job. For more information, see Job Details Page.

Schema Refresh

Note

This feature may not be available in all product editions. For more information on available features, see Compare Editions.

Schema refresh enables on-demand updating of your imported dataset schemas to capture changes to columns. For example, when you are working with datasets in a flow view, you can refresh your imported datasets' schemas by checking the source schema for changes. Schema refresh automatically generates a new initial sample, which allows you to gather fresh data in the Transformer page.

Schema refresh applies to:

Relational schemas
Schematized files
Delimited files
Note
Delimited files include CSVs and TSVs and can include other files whose delimiters can be inferred by the Trifacta Application during import. Delimited files do not contain data type information; data types are inferred by the Trifacta Application for these file types.
Converted file formats:
- JSON files that are converted during ingest. For more information, see Working with JSON v2.
- PDF. See Import PDF Data.
- Excel. See Import Excel Data.
- Google Sheets. See Import Google Sheets Data.
- For multi-sheet sources:
  - If each sheet is converted into a separate dataset, the schema of a dataset is refreshed from the source sheet.
  - If multiple sheets are combined into a single dataset, the schema of a dataset is refreshed from the first sheet in the source.

Key Benefits:

Reduces the number of duplicate or invalid datasets created from the same source.
Reduces challenges of replacing datasets and retaking samples.

Limitations

If a column's data type is modified and other changes, such as column name changes, are not detected, this change is not considered a schema drift error.
You cannot refresh the schemas of reference datasets or uploaded sources.
Schema refresh does not apply to any file formats that require conversion to native formats.
Note
Schema management does not work for JSON-based imported datasets that were created under the v1 legacy method of JSON import. All JSON imported datasets created under the legacy method (v1) of JSON import must be recreated to behave like v2 datasets with respect to conversion and schema management. Features developed in the future may not retroactively be supported in the v1 legacy mode. For more information, see Working with JSON v2.

Note

If you have imported a flow from an earlier version of the application, you may receive warnings of schema drift during job execution when there have been no changes to the underlying schema. This is a known issue. The workaround is to create a new version of the underlying imported dataset and use it in the imported flow.

Limitations for parameterized datasets

Parameterized files:

Note

If you attempt to refresh the schema of a parameterized dataset based on a set of files, only the schema for the first file is checked for changes. If changes are detected, the other files are assumed to contain those changes as well. This can lead to changes being assumed or undetected in later files and potential data corruption in the flow.

Parameterized tables:

Note

Refreshing the schema of a parameterized dataset using custom SQL is not supported.

Effects of refreshing schemas

Warning

When you choose to refresh a schema, the schema is refreshed without checking for changes, which forces the deletion of all samples and recollection of a new initial sample. All pre-existing samples must be recreated. In some environments, this sample collection incurs costs.

When you refresh the schema in the Trifacta Application:

The source schema is applied to the imported dataset in all cases.
- All the existing samples are invalidated.
- A new initial sample is generated, which updates the previewed data. This may take some time.
Addition or removal of columns may cause recipe steps to break, which can cause any transformation jobs to fail. You must fix these broken steps in the Recipe panel.
Tip
For some data-dependent recipe steps, such as joins and pivots, that are listed as broken, you may be able to edit the step and immediately save it to repair the step.

Refresh your schemas

For more information on how to refresh the schemas of your datasets, see:

Via API:

For more information, see Dataprep by Trifacta: API Reference docs

Output Schemas

Output type conversions

Depending on the output system, Dataprep by Trifacta can deliver your results in columns and data types native to the target. For more information, see Type Conversions.

Target schemas

As needed, you can import a dataset the columns of which can serve as the target of your transformation efforts. When this target schema is imported, it is super-imposed on the columns of your dataset in the Transformer page, allowing you to quickly change the naming, order, and data typing of your columns to match the target schema. For more information, see Overview of Target Schema Mapping.

Overview of Schema Management

Overview of Schemas

Input type conversions

Schema Validation

Limitations

Enable

Settings

File settings

Use

Job-level overrides

Job results

Schema Refresh

Limitations

Limitations for parameterized datasets

Effects of refreshing schemas

Refresh your schemas

Output Schemas

Output type conversions

Target schemas

Search results