The Designer Cloud Powered by Trifacta platform can be configured to integrate with fully compressed Hadoop clusters. The following cluster compression methods are supported:
Gzip
Bzip2
Snappy
Supported compressed running environments:
Spark
For more information, see Running Environment Options.
Hadoop clusters can be configured to enable compression of intermediate and/or final output data by default. The settings that are usually used to do so can be found in mapred-site.xml
and core-site.xml
.
Note
If you have not done so already, you must retrieve cluster configuration files and store them on the Trifacta node. For more information, see Configure for Hadoop.
Steps:
Edit the local version of
mapred-site.xml
. This file is typically located in/etc/conf/hadoop
.Add the following properties:
<configuration> ... <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> ... </configuration>
Save the file and complete the following steps.
One or more compression/decompression methods (codecs) must be specified in core-site.xml
.
Steps:
Edit the local version of
mapred-site.xml
. This file is typically located in/etc/conf/hadoop
.Specify the codecs to use in the
io.compression.codecs
property. Supported values:Code
Value
Gzip
org.apache.hadoop.io.compress.GzipCodec
Bzip2
org.apache.hadoop.io.compress.BZip2Codec
Snappy
org.apache.hadoop.io.compress.SnappyCodec
In the following example, all three codecs have been specified:
<configuration> ... <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value> </property> ... </configuration>
Save the file.
Apply the following changes from within the application to enable the Designer Cloud Powered by Trifacta platform to communicate with the compressed cluster.
Steps:
Login to the application.
In the Admin Settings page, set the following settings:
Setting
Description
hadoopDefaultClusterCompression.enabled
To enable integration with a compressed cluster, set this value to
true
.hadoopDefaultClusterCompression.compression
Set this value to the type of compression applied on the cluster:
none
- (default) no cluster compressiongzip
bzip2
snappy
Save your changes and restart the platform.