Configure for Spark
The Designer Cloud Powered by Trifacta platform can be configured to work with Spark to execute job results, a visual profile of transform job results, or both.
A visual profile is a visual summary of a dataset. It visually identifies areas of interest, including valid, missing, or mismatched values, as well as useful statistics on column data.
Visual profiles can be optionally generated using Spark.
In the Trifacta Application, visual profiles appear in the Job Details page when a job has successfully executed and a profile has been requested for it. SeeJob Details Page.
For more information, see Overview of Visual Profiling.
Apache Spark provides in-memory processing capabilities for a Hadoop cluster. In Spark, the processing of the large volume of computations to generate this information is performed in-memory. This method reduces disk access and significantly improves overall performance. For more information, see https://spark.apache.org/.
The Spark Job Service is a Scala-based capability for executing jobs and profiling your job results as an extension of job execution. This feature leverages the computing power of your existing Hadoop cluster to increase job execution and profiling performance. Features:
Requires no additional installation on the Trifacta node.
Support for yarn-cluster mode ensures that all Spark processing is handled on the Hadoop cluster.
Exact bin counts appear for profile results, except for Top-N counts.
Supported Versions
The following versions of Spark are supported:
Note
Depending on the version of Spark and your Hadoop distribution, additional configuration may be required. See Configure Spark Version below.
Spark 3.2.0, Spark 3.2.1
Note
Spark 3.2.x is supported only on specific deployments and versions of the following environments:
Azure Databricks 10.x
AWS Databricks 10.x
Spark 3.0.1
Note
Spark 3.0.1 is supported only on specific deployments and versions of the following environments:
AWS Databricks 7.3 LTS (Recommended)
EMR 6.2.1, EMR 6.3
Spark 2.4.6 Recommended
Spark 2.3.x
Prerequisites
Note
Spark History Server is not supported. It should be used only for short-term debugging tasks, as it requires considerable resources.
Before you begin, please verify the following:
For additional prerequisites for a kerberized environment, see Configure for Kerberos Integration.
Additional configuration is required for secure impersonation. See Configure for Secure Impersonation.
Configure the Designer Cloud Powered by Trifacta platform
Configure Spark Job Service
The Spark Job Service must be enabled for both execution and profiling jobs to work in Spark.
Below is a sample configuration and description of each property. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods..
"spark-job-service" : { "systemProperties" : { "java.net.preferIPv4Stack": "true", "SPARK_YARN_MODE": "true" }, "sparkImpersonationOn": false, "optimizeLocalization": true, "mainClass": "com.trifacta.jobserver.SparkJobServer", "jvmOptions": [ "-Xmx128m" ], "hiveDependenciesLocation": "%(topOfTree)s/hadoop-deps/cdh-6.2/build/libs", "env": { "SPARK_JOB_SERVICE_PORT": "4007", "SPARK_DIST_CLASSPATH": "", "MAPR_TICKETFILE_LOCATION": "<MAPR_TICKETFILE_LOCATION>", "MAPR_IMPERSONATION_ENABLED": "0", "HADOOP_USER_NAME": "trifacta", "HADOOP_CONF_DIR": "%(topOfTree)s/conf/hadoop-site/" }, "enabled": true, "enableHiveSupport": true, "enableHistoryServer": false, "classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/services/spark-job-server/build/bundle/*:%(topOfTree)s/%(hadoopBundleJar)s", "autoRestart": false, },
The following properties can be modified based on your needs:
Note
Unless explicitly told to do so, do not modify any of the above properties that are not listed below.
Property | Description |
---|---|
sparkImpersonationOn | Set this value to |
jvmOptions | This array of values can be used to pass parameters to the JVM that manages Spark Job Service. |
hiveDependenciesLocation | If Spark is integrated with a Hive instance, set this value to the path to the location where Hive dependencies are installed on the Trifacta node. For more information, see Configure for Hive. |
env.SPARK_JOB_SERVICE_PORT | Set this value to the listening port number on the cluster for Spark. Default value is |
env.HADOOP_USER_NAME | The username of the Hadoop principal used by the platform. By default, this value is |
env.HADOOP_CONF_DIR | The directory on the Trifacta node where the Hadoop cluster configuration files are stored. Do not modify unless necessary. |
enabled | Set this value to |
enableHiveSupport | See below. |
After making any changes, save the file and restart the platform. See Start and Stop the Platform.
Configure service for Hive
Depending on the environment, please apply the following configuration changes to manage Spark interactions with Hive:
Environment | spark.enableHiveSupport |
---|---|
Hive is not present | false |
Hive is present but not enabled. | false |
Hive is present and enabled | true |
If Hive is present on the cluster and either enabled or disabled: the hive-site.xml
file must be copied to the correct directory:
cp /etc/hive/conf/hive-site.xml /opt/trifacta/conf/hadoop-site/hive-site.xml
At this point, the platform only expects that ahive-site.xml
file has been installed on theTrifacta node. A valid connection is not required. For more information, seeConfigure for Hive.
Configure Spark
After the Spark Job Service has been enabled, please complete the following sections to configure it for the Designer Cloud Powered by Trifacta platform.
Yarn cluster mode
All jobs submitted to the Spark Job Service are executed in YARN cluster mode. No other cluster mode is supported for the Spark Job Service.
Configure access for secure impersonation
The Spark Job Service can run under secure impersonation. For more information, see Configure for Secure Impersonation.
When running under secure impersonation, the Spark Job Service requires access to the following folders. Read, write, and execute access must be provided to the Alteryx user and the impersonated user.
Folder Name | Platform Configuration Property | Default Value | Description |
---|---|---|---|
Alteryx Libraries folder | "hdfs.pathsConfig.libraries" | /trifacta/libraries | Maintains JAR files and other libraries required by Spark. No sensitive information is written to this location. |
Alteryx Temp files folder | "hdfs.pathsConfig.tempFiles" | /trifacta/tempfiles | Holds temporary progress information files for YARN applications. Each file contains a number indicating the progress percentage. No sensitive information is written to this location. |
Alteryx Dictionaries folder | "hdfs.pathsConfig.dictionaries" | /trifacta/dictionaries | Contains definitions of dictionaries created for the platform. |
Identify Hadoop libraries on the cluster
The Spark Job Service does not require additional installation on the Trifacta node or on the Hadoop cluster. Instead, it references the spark-assembly JAR that is provided with the Alteryx distribution.
This JAR file does not include the Hadoop client libraries. You must point the Designer Cloud Powered by Trifacta platform to the appropriate libraries.
Steps:
In platform configuration, locate the
spark-job-service
configuration block.Set the following property:
"spark-job-service.env.HADOOP_CONF_DIR": "<path_to_Hadoop_conf_dir_on_Hadoop_cluster>",
Property
Description
spark-job-service.env.HADOOP_CONF_DIR
Path to the Hadoop configuration directory on the Hadoop cluster.
In the same block, the
SPARK_DIST_CLASSPATH
property must be set depending on your Hadoop distribution.Save your changes.
Locate Hive dependencies location
If the Designer Cloud Powered by Trifacta platform is also connected to a Hive instance, please verify the location of the Hive dependencies on the Trifacta node. The following example is from Cloudera 6.2:
Note
This parameter value is distribution-specific. Please update based on your Hadoop distribution.
"spark-job-service.hiveDependenciesLocation":"%(topOfTree)s/hadoop-deps/cdh-6.2/build/libs",
Specify YARN queue for Spark jobs
Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the Designer Cloud Powered by Trifacta platform are submitted to this queue.
Steps:
In platform configuration, locate the following:
"spark.props.spark.yarn.queue": "default",
Replace
default
with the name of the queue.Save your changes.
Spark tuning properties
In addition to the specific Spark properties that are exposed in platform configuration, you can pass properties and their values to the Spark running environment, which are interpreted and applied during the execution of your jobs. In platform configuration, these properties and values are passed in through the spark.props
area, where you specify the property name and value in JSON format.
For example, you can pass additional properties to Spark such as number of cores, executors, and memory allocation. Some examples are below.
Note
The following values are default values. If you are experiencing performance issues, you can modify the values. If you require further assistance, please contact Alteryx Support.
In Admin Settings:
"spark.props.spark.driver.maxResultSize": "0",
In
trifacta-conf.json
:"spark": { ... "props": { "spark.executor.memory": "6GB", "spark.executor.cores": "2", "spark.driver.memory": "2GB" ... } },
If you have sufficient cluster resources, you should pass the following values:
In Admin Settings:
"spark.props.spark.driver.maxResultSize": "0",
In
trifacta-conf.json
:"spark": { ... "props": { "spark.executor.memory": "16GB", "spark.executor.cores": "5", "spark.driver.memory": "16GB" ... } },
Notes:
The above values must be below the per-container thresholds set by YARN. Please verify your settings against the following parameters in
yarn-site.xml
:yarn.scheduler.maximum-allocation-mb yarn.scheduler.maximum-allocation-vcores yarn.nodemanager.resource.memory-mb yarn.nodemanager.resource.cpu-vcores
If you are using YARN queues, please verify that these values are set below max queue thresholds.
For more information on these properties, see https://spark.apache.org/docs/2.2.0/configuration.html.
Save your changes.
Configure Batch Job Runner for Spark service
You can modify the following Batch Job Runner configuration settings for the Spark service.
Note
Avoid modifying these settings unless you are experiencing issues with the user interface reporting jobs as having failed while the Spark job continues to execute on YARN.
Setting | Description | Default |
---|---|---|
batchserver.spark.requestTimeoutMillis | Specifies the number of milliseconds that the Batch Job Runner service should wait for a response from Spark. If this timeout is exceeded, the UI changes the job status to failed. The YARN job may continue. | 600000 (600 seconds) |
Review and set the following parameter.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Verify that the Spark master property is set accordingly:
"spark.master": "yarn",
Review and set the following parameter based on your Hadoop distribution:
Note
This setting is ignored for EMR, Azure Databricks and AWS Databricks, which always use the vendor libraries.
Hadoop Distribution
Parameter Value
Value is required?
Cloudera Data Platform 7.1
"spark.useVendorSparkLibraries": true,
Yes. Additional configuration is required.
CDH 6.x
"spark.useVendorSparkLibraries": true,
Yes. Additional configuration is in the next section.
Locate the following setting:
"spark.version"
Set the above value based on your Hadoop distribution in use:
Hadoop Distribution
spark.version
Notes
3.0.1
Note
This version of Spark is available for selection through the Trifacta Application. It is supported for a limited number of running environments. Additional information is provided later.
Cloudera Data Platform 7.1
2.4.cdh6.3.3.plus
Note
Please set the Spark version to the value indicated. This special value accounts for unexpected changes to filenames in the CDH packages.
CDH 6.3.3
2.4.cdh6.3.3.plus
Note
Please set the Spark version to the value indicated. This special value accounts for unexpected changes to filenames in the CDH packages.
Note
If the Trifacta node is installed on an edge node of the cluster, you may skip this section.
You must acquire native Hadoop libraries from the cluster if you are using any of the following versions:
Hadoop version | Library location on cluster | Trifacta node location |
---|---|---|
Cloudera Data Platform 7.1 | /opt/cloudera/parcels/CDH-7.1.1-1.cdh7.1.1.*/ The last directory name may vary between minor distributions. | See section below. |
Cloudera 6.0 or later | /opt/cloudera/parcels/CDH | See section below. |
Note
Whenever the Hadoop distribution is upgraded on the cluster, the new versions of these libraries must be recopied to the following locations on the Trifacta node. This maintenance tasks is not required in the Trifacta node is an edge node of the cluster.
For more information on acquiring these libraries, please see the documentation provided with your Hadoop distribution.
To integrate with CDH 6.x, the platform must use the native Spark libraries. Please add the following properties to your configuration.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Set
sparkBundleJar
to the following:For Cloudera 6.x:
"sparkBundleJar":"/opt/cloudera/parcels/CDH/lib/spark/jars/*:/opt/cloudera/parcels/CDH/lib/spark/hive/*"
For Cloudera Data Platform, see "Configure Spark for Cloudera Data Platform."
For the Spark Job Service, the Spark bundle JAR must be added to the classpath:
Note
The key modification is to remove the
topOfTree
element from thesparkBundleJar
entry.For Cloudera 6.x:
"spark-job-service": { "classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:/usr/lib/hdinsight-datalake/*:%(sparkBundleJar)s:%(topOfTree)s/%(hadoopBundleJar)s" },
For Cloudera Data Platform, see "Configure Spark for Cloudera Data Platform."
In the
spark.props
section, add the following property:For Cloudera 6.x:
"spark.yarn.jars":"local:/opt/cloudera/parcels/CDH/lib/spark/jars/*,local:/opt/cloudera/parcels/CDH/lib/spark/hive/*",
For Cloudera Data Platform, see "Configure Spark for Cloudera Data Platform."
Save your changes.
The Designer Cloud Powered by Trifacta platform can integrate with Spark on Cloudera Data Platform (CDP) to run jobs on ACID tables in the following deployment:
CDP Private Cloud 7.1.5
Spark 2.4.x
Hive Warehouse Connector
Note
Spark Direct Reader mode only is supported for Hive Warehouse Connector.
Steps:
Please complete the following additional steps to enable this integration.
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.CDP 7.1.x requires the use of Spark 2.4.x, which requires a special Scala version. Please review and change if needed the following settings:
"spark.version": "2.4.cdh6.3.3.plus", "spark.scalaVersion": "2.11", "spark.useVendorSparkLibraries": true, "sparkBundleJar": "/opt/cloudera/parcels/CDH/lib/spark/jars/*:/opt/cloudera/parcels/CDH/lib/spark/hive/*",
Please insert the following settings in the
spark.props
area of configuration:"spark": { ... "props": { "spark.yarn.jars": "local:/opt/cloudera/parcels/CDH/lib/spark/jars/*,local:/opt/cloudera/parcels/CDH/lib/spark/hive/*,local:/opt/cloudera/parcels/CDH/jars/hive-warehouse-connector-assembly-1.0.0.7.1.4.0-203.jar", "spark.sql.extensions": "com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension", "spark.datasource.hive.warehouse.read.via.llap": "false", "spark.sql.hive.hwc.execution.mode": "spark", "spark.hadoop.hive.metastore.uris": "thrift://example.com:9083", "spark.kryo.registrator": "com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator", "spark.sql.hive.hiveserver2.jdbc.url": "jdbc:hive2://example.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" ... } },
Note
All paths listed in the above properties must be verified against the CDP environment.
Note
For CDP 7.1.7 SP2 onwards, the following principal and keytab should be passed in the spark properties in Triconf:“
"spark.yarn.keytab”: “/opt/trifacta/<keytab-name>;.keytab”
“spark.yarn.principal”: “<principal-name>/hostname@example.com”
Property
Description
"spark.yarn.jars"
For the
hive-warehouse-connector-assembly
JAR file, the path and version information depends on your installation. Please use a value that matches your environment."spark.hadoop.hive.metastore.uris"
This value can be obtained in the following file:
/etc/hive/conf/hive-site.xml
."spark.sql.hive.hiveserver2.jdbc.url"
This value can be obtained in the following file:
"/etc/hive/conf.cloudera.hive_on_tez/beeline-site.xml"
.Update the classpath value for the Spark job service:
"spark-job-service": { "classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(sparkBundleJar)s:%(topOfTree)s/%(hadoopBundleJar)s", },
Save your changes and restart the platform.
Note
The Spark version settings in this section do not apply to Databricks, which has a dedicated Spark version property: databricks.sparkVersion
.
For more information, see Configure for AWS Databricks.
For more information, see Configure for Azure Databricks.
The Designer Cloud Powered by Trifacta platform defaults to using Spark 2.3.0. Depending on the version of your Hadoop distribution, you may need to modify the version of Spark that is used by the platform.
In the following table, you can review the Spark/Java version requirements for the Trifacta node hosting Designer Cloud Powered by Trifacta Enterprise Edition.
To change the version of Spark in use by theDesigner Cloud Powered by Trifacta platform, you change the value of thespark.version
property, as listed below. No additional installation is required.
Note
If you are integrating with an EMR cluster, the version of Spark to configure for use depends on the version of EMR. Additional configuration is required. See Configure for EMR.
Note
The value for spark.version
does not need to be set for Databricks. The version of Spark for Databricks is controlled by a different setting.
Additional requirements:
The supported cluster must use Java JDK 8 or 11.
If the platform is connected to an EMR cluster, you must set the local version of Spark (
spark.version
property) to match the version of Spark that is used on the EMR cluster.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json
. For more information, see Platform Configuration Methods.
Required Java JDK Versions | Java JDK 8 or 11 |
---|---|
Spark forDesigner Cloud Powered by Trifacta Enterprise Edition | "spark.version": "2.3.0", |
Required Java JDK Version | Java JDK 8 or 11 |
---|---|
Spark for Designer Cloud Powered by Trifacta Enterprise Edition | "spark.version": "2.4.6", Note For CDH 6.3.3, please set the Note For Spark 2.4.0 and later, please verify that the following is set: "spark.useVendorSparkLibraries": true, This configuration is ignored forEMR and Azure Databricks. |
Required Java JDK Version | Java JDK 8 or 11 |
---|---|
Spark for Designer Cloud Powered by Trifacta Enterprise Edition | "spark.version": "3.0.1", Note Support for this version of Spark is limited. Additional configuration is required. See below. Additional configuration requirements for this version:
|
Required Java JDK Version | Java JDK 8 or 11 |
---|---|
Spark for Designer Cloud Powered by Trifacta Enterprise Edition | "spark.version": "3.2.0", or "spark.version": "3.2.1", Note Support for this version of Spark is limited. Additional configuration is required. See below. Additional configuration requirements for this version:
|
When profiling is enabled for a job, the Spark running environment executes the transformation and profiling tasks of these jobs together by default. This combination of jobs is faster and more efficient than separating the jobs. As needed, these tasks can be separated. The following behaviors vary depending on whether these tasks are separated or combined:
When transform and profiling are combined:
If the transform task succeeds and profiling fails, then the job is shown as failed in the Job Details page. The generated datasets may be available for download.
Since the job failed, no publishing actions are launched.
When transform and profiling are separated:
If the transform task succeeds and profiling fails, then the the generated datasets may be available for download.
When the transform task succeeds, any defined publishing actions are launched.
For more information on configuring these options, see Workspace Settings Page.
For more information, see Job Details Page.
For more information on executing jobs on Spark, see Configure Spark Running Environment.
For more information on visual profiling, see Overview of Visual Profiling.
You can restart the platform now. See Start and Stop the Platform.
At this point, you should be able to run a job in the platform, which launches a Spark execution job and a profiling. Results appear normally in the Trifacta Application.
Steps:
To verify that the Spark running environment is working:
After you have applied changes to your configuration, you must restart services. See Start and Stop the Platform.
Through the application, run a simple job, including visual profiling. Be sure to select Spark as the running environment.
The job should appear as normal in the Job Status page.
To verify that it ran on Spark, open the following file:
/opt/trifacta/logs/batch-job-runnner.log
Search the log file for a
SPARK JOB INFO
block with a timestamp corresponding to your job execution.See below for information on how to check the job-specific logs.
Review any errors.
For more information, see Verify Operations.
Service logs:
Logs for the Spark Job Service are located in the following location:
/opt/trifacta/logs/spark-job-service.log
Additional log information on the launching of profile jobs is located here:
/opt/trifacta/logs/batch-job-runner.log
Job logs:
When profiling jobs fail, additional log information is written to the following:
/opt/trifacta/logs/jobs/<job-id>/spark-job.log
Below is a list of common errors in the log files and their likely causes.
Whenever a Spark job is executed, it is reported back as having failed. On the cluster, the job appears to have succeeded. However, in the Spark Job Service logs, the Spark Job Service cannot find any of the applications that it has submitted to resource manager.
In this case, the root problem is that Spark is unable to delete temporary files after the job has completed execution. During job execution, a set of ephemeral files may be written to the designated temporary directory on the cluster, which is typically /trifacta/tempfiles
. In most cases, these files are removed transparent to the user.
This location is defined in the hdfs.pathsConfig.tempFiles parameter in
trifacta-conf.json
.
In some cases, those files may be left behind. To account for this accumulation in the directory, the Designer Cloud Powered by Trifacta platform performs a periodic cleanup operation to remove temp files that are over a specified age.
The age in days is defined in the the job.tempfiles.cleanup.age parameter in
trifacta-conf.json
.
This cleanup operation can fail if HDFS is configured to send Trash to an encrypted zone. The HDFS API does not support the skipTrash
option, which is available through the HDFS CLI. In this scenario, the temp files are not successfully removed, and the files continue to accumulate without limit in the temporary directory. Eventually, this accumulation of files can cause the Spark Job Service to crash with Out of Memory errors.
The following are possible solutions:
Solution 1: Configure HDFS to use an unencrypted zone for Trash files.
Solution 2:
Disable temp file cleanup in
trifacta-conf.json
:"job.tempfiles.cleanup.age": 0,
Clean up the
tempfiles
directory using an external process.
Spark job service fails to start with an error similar to the following in the spark-job-service.log
file:
Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.5.3
Some versions of the hadoopBundleJar contain older versions of the Jackson dependencies, which break the spark-job-service.
To ensure that the spark-job-service is provided the correct Jackson dependency versions, the sparkBundleJar
must be listed before the hadoopBundleJar
in the spark-job-service.classpath
, which is inserted as a parameter in trifacta-conf.json
. Example:
"spark-job-service.classpath": "%(topOfTree)s/services/spark-job-server/server/build/install/server/lib/*:%(topOfTree)s/conf/hadoop-site/:%(topOfTree)s/%(sparkBundleJar)s:%(topOfTree)s/%(hadoopBundleJar)s"
Spark jobs may fail with the following error in the YARN application logs:
ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalArgumentException: Unknown message type: -22 at org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder.fromByteBuffer(BlockTransferMessage.java:67)
This problem may occur if Spark authentication is disabled on the Hadoop cluster but enabled in the Designer Cloud Powered by Trifacta platform. Spark authentication must match on the cluster and the platform.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the
spark.props
entry.Insert the following property and value:
"spark.authenticate": "false"
Save your changes and restart the platform.
When Spark authentication is enabled on the Hadoop cluster, Spark jobs can fail. The YARN log file message looks something like the following:
17/09/22 16:55:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, example.com, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?) at org.apache.spark.network.sasl.SaslMessage.decode(SaslMessage.java:69)
When Spark authentication is enabled on the Hadoop cluster, the Designer Cloud Powered by Trifacta platform must also be configured with Spark authentication enabled.
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Inside the
spark.props
entry, insert the following property value:"spark.authenticate": "true"
Save your changes and restart the platform.
When executing a job through Spark, the job may fail with the following error in the spark-job-service.log
:
Required executor memory (6144+614 MB) is above the max threshold (1615 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
The per-container memory allocation in Spark (spark.executor.memory
and spark.driver.memory
) must not exceed the YARN thresholds. See Spark tuning properties above.
When executing a job through Spark, the job may fail with the following error in the spark-job-service.log
file:
Job submission failed akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://SparkJobServer/user/ProfileLauncher#1213485950]] after [20000 ms]
There is a 20-second timeout on the attempt to submit a Profiling job to Yarn. If the initial upload of the spark libraries to the cluster takes longer than 20 seconds, the spark-job-service times out and returns an error to the UI. However, the libraries do finish uploading successfully to the cluster.
The library upload is a one-time operation for each install/upgrade. Despite the error, the libraries are uploaded successfully the first time. This error does not affect subsequent profiler job runs.
Solution:
Try running the job again.
When executing a job through Spark, the job may fail with the following error in the spark-job-service.log
file:
java.lang.ClassNotFoundException: com.trifacta.jobserver.profiler.Profiler
By default, the Spark job service attempts to optimize the distribution of the Spark JAR files across the cluster. This optimization involves a one-time upload of the spark-assembly and profiler-bundle JAR files to HDFS. Then, YARN distributes these JARs to the worker nodes of the cluster, where they are cached for future use.
In some cases, the localized JAR files can get corrupted on the worker nodes, causing this ClassNotFound error to occur.
Solution:
The solution is to disable this optimization through platform configuration.
Steps:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the
spark-job-service
configuration node.Set the following property to
false
:"spark-job-service.optimizeLocalization" : true
Save your changes and restart the platform.
When executing a job through Spark, the job may fail with the following error in the spark-job-service.log
file:
Exception in thread "LeaseRenewer:trifacta@nameservice1" java.lang.OutOfMemoryError: PermGen space
Solution:
The solution is to configure the PermGen space for the Spark driver:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the
spark
configuration node.Set the following property to the given value:
"spark.props.spark.driver.extraJavaOptions" : "-XX:MaxPermSize=1024m -XX:PermSize=256m",
Save your changes and restart the platform.
When executing a job through Spark, the job may fail with the following error in the spark-job-service.log
file:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token x for trifacta) can't be found in cache
Solution:
The solution is to set Spark impersonation to true
:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the
spark-job-service
configuration node.Set the following property to the given value:
"spark-job-service.sparkImpersonationOn" : true,
Save your changes and restart the platform.
Issue:
Spark fails with an error similar to the following in the spark-job-service.log:
"Job aborted due to stage failure: Total size of serialized results of 208 tasks (1025.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"
Explanation:
The spark.driver.maxResultSize
parameter determines the limit of the total size of serialized results of all partitions for each Spark action (e.g. collect). If the total size of the serialized results exceeds this limit, the job is aborted.
To enable serialized results of unlimited size, set this parameter to zero (0
).
Solution:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.To the
spark.props
section of the file, remove the size limit by setting this value to zero:"spark.driver.maxResultSize": "0"
Save your changes and restart the platform.
Issue:
Spark job fails with an error similar to the following in either the spark-job.log
or the yarn-app.log
file:
"java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState'"
Explanation:
By default, the Spark running environment attempts to connect to Hive when it creates the Spark Context. This connection attempt may fail if Hive connectivity (in conf/hadoop-site/hive-site.xml
) is not configured correctly on the Trifacta node.
Solution:
This issue can be fixed by configuring Hive connectivity on the edge node.
If Hive connectivity is not required, the Spark running environment's default behavior can be changed as follows:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.In the
spark-job-service
section of the file, disable Hive connectivity by setting this value tofalse
:"spark-job-service.enableHiveSupport": false
Save your changes and restart the platform.
Issue:
Spark job fails with an error similar to the following in either the spark-job.log
or the yarn-app.log
file:
java.io.FileNotFoundException: No Avro files found. Hadoop option "avro.mapred.ignore.inputs.without.extension" is set to true. Do all input files have ".avro" extension?
Explanation:
By default, Spark-Avro requires all Avro files to have the .avro extension, which includes all part files in a source directory. Spark-Avro ignores any files that do not have the .avro extension.
If a directory contains part files without an extension (e.g. part-00001
, part-00002
), Spark-Avro ignores these files and throws the "No Avro files found" error.
Solution:
This issue can be fixed by setting the spark.hadoop.avro.mapred.ignore.inputs.without.extension
property to false
:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.To the
spark.props
section of the file, add the following setting if it does not already exist. Set its value tofalse
:"spark.hadoop.avro.mapred.ignore.inputs.without.extension": "false"
Save your changes and restart the platform.
Issue:
After you have submitted a job to be executed on the Spark cluster, the job may fail in the Designer Cloud Powered by Trifacta platform after 30 minutes. However, on the busy cluster, the job remains enqueued and is eventually collected and executed. Since the job was canceled in the platform, results are not returned.
Explanation:
This issue is caused by a timeout setting for Batch Job Runner, which cancels management of jobs after a predefined number of seconds. Since these jobs are already queued on the cluster, they may be executed independent of the platform.
Solution:
This issue can be fixed by increasing the Batch Job Runner Spark timeout setting:
You can apply this change through the Admin Settings Page (recommended) or
trifacta-conf.json
. For more information, see Platform Configuration Methods.Locate the following property. By default, it is set to
172800
, which is 48 hours:"batchserver.spark.progressTimeoutSeconds": 172800,
If your value is lower than the default, you can increase this value high enough for your job to succeed.
Save your changes and restart the platform.
Re-run the job.