Enable Integration with Compressed Clusters
The Designer Cloud Powered by Trifacta platform can be configured to integrate with fully compressed Hadoop clusters. The following cluster compression methods are supported:
Gzip
Bzip2
Snappy
Supported compressed running environments:
Spark
For more information, see Running Environment Options.
Hadoop clusters can be configured to enable compression of intermediate and/or final output data by default. The settings that are usually used to do so can be found in mapred-site.xml
and core-site.xml
.
Prerequisites
Note
If you have not done so already, you must retrieve cluster configuration files and store them on the Trifacta node. For more information, see Configure for Hadoop.
Enable integration with compression
Steps:
Edit the local version of
mapred-site.xml
. This file is typically located in/etc/conf/hadoop
.Add the following properties:
<configuration> ... <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> ... </configuration>
Save the file and complete the following steps.
Specify codecs
One or more compression/decompression methods (codecs) must be specified in core-site.xml
.
Steps:
Edit the local version of
mapred-site.xml
. This file is typically located in/etc/conf/hadoop
.Specify the codecs to use in the
io.compression.codecs
property. Supported values:Code
Value
Gzip
org.apache.hadoop.io.compress.GzipCodec
Bzip2
org.apache.hadoop.io.compress.BZip2Codec
Snappy
org.apache.hadoop.io.compress.SnappyCodec
In the following example, all three codecs have been specified:
<configuration> ... <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value> </property> ... </configuration>
Save the file.
Configure platform
Apply the following changes from within the application to enable the Designer Cloud Powered by Trifacta platform to communicate with the compressed cluster.
Steps:
Login to the application.
In the Admin Settings page, set the following settings:
Setting
Description
hadoopDefaultClusterCompression.enabled
To enable integration with a compressed cluster, set this value to
true
.hadoopDefaultClusterCompression.compression
Set this value to the type of compression applied on the cluster:
none
- (default) no cluster compressiongzip
bzip2
snappy
Save your changes and restart the platform.