Character Encoding
This section describes how Dataprep by Trifacta manages character encoding on import, within the application, and on export.
Overview of Character Encoding
Character encoding refers to the mechanism by which numeric digital data is used to represent characters, including alphanumeric characters and punctuation, in languages around the world. To ensure that different machines can represent the same thing on-screen, each machine can reference one or more of the supported file encoding types, which are standards for representation of characters. For example, a machine in the United Kingdom will represent the letter "A" sent from a machine in the United States if they are using the same encoding file encoding types.
In many languages around the world, the representation of all characters requires hundreds and even thousands of characters. As a result, encodings for these regions may require a larger number of bits to represent all aspects of the language.
Character Encoding on Input
By default, Dataprep by Trifacta supports UTF-8 on input. As needed, individual users can change the file encoding of input files. For example, a file that is ingested with a double-byte encoding can be identified as such for the product in the file settings during import, so that the data can be properly parsed during input.
Character Encoding within the Application
Within the Trifacta Application, you can use the following functions to modify character encodings:
Item | Description |
---|---|
Converts an input value to base64 encoding with optional padding with an equals sign ( | |
Converts an input base64 value to text. Output type is String. | |
Generates the Unicode index value for the first character of the input string. |
Character Encoding on Output
All files are published with UTF-8 encoding.