Skip to main content

Character Encoding

This section describes how Dataprep by Trifacta manages character encoding on import, within the application, and on export.

Overview of Character Encoding

Character encoding refers to the mechanism by which numeric digital data is used to represent characters, including alphanumeric characters and punctuation, in languages around the world. To ensure that different machines can represent the same thing on-screen, each machine can reference one or more of the supported file encoding types, which are standards for representation of characters. For example, a machine in the United Kingdom will represent the letter "A" sent from a machine in the United States if they are using the same encoding file encoding types.

In many languages around the world, the representation of all characters requires hundreds and even thousands of characters. As a result, encodings for these regions may require a larger number of bits to represent all aspects of the language.

Character Encoding on Input

By default, Dataprep by Trifacta supports UTF-8 on input. As needed, individual users can change the file encoding of input files. For example, a file that is ingested with a double-byte encoding can be identified as such for the product in the file settings during import, so that the data can be properly parsed during input.

Character Encoding within the Application

Within the Trifacta Application, you can use the following functions to modify character encodings:

Item

Description

BASE64ENCODE Function

Converts an input value to base64 encoding with optional padding with an equals sign (=). Input can be of any type. Output type is String.

BASE64DECODE Function

Converts an input base64 value to text. Output type is String.

UNICODE Function

Generates the Unicode index value for the first character of the input string.

Character Encoding on Output

All files are published with UTF-8 encoding.