Apache Spark Direct

Version:
2019.3
Last modified: October 07, 2019
Connection Type

REST/HTML server

Distributions Validated On

Hortonworks 2.6+; Cloudera 5.7+

Server Details

Apache Livy download information can be found here.

Type of Support

In-Database

Validated On

Apache Livy 0.3; Apache Spark 1.6, 2.0, 2.1, and 2.2

Alteryx tools used to connect

In-database workflow processing

Link
Blue icon with database being plugged in.

Connect In-DB Tool

Link
Blue icon with a stream-like object flowing into a database.

Data Stream In Tool

Link

Apache Spark Code Tool

Connect to Apache Spark by dragging a Connect In-DB tool or the Apache Spark Code tool onto the canvas. Create a new Livy connection using the Apache Spark Direct driver. Use the instructions below to configure the connection.

Configure the Livy Connection window

To connect to Livy Server and create an Alteryx connection string:

Add a new In-DB connection, setting Data Source to Apache Spark Direct. For more information on setting up an In-DB connection, see Connect In-DB Tool.

On the Read tab, Driver will be locked to Apache Spark Direct. Click the Connection String drop-down arrow and select New database connection.

Configure the Livy Connection window.

Livy Server Configuration

Select your security preference:

None
  • Type or paste the Host IP Address or DNS name of the Livy node within your Apache Spark cluster.
  • Type the Port used by Livy. The default port is 8998.
  • Optionally provide the User Name to set user impersonation, the name that Apache Spark will use when running jobs.
Knox
  • Type or paste the URL of your Knox gateway.
  • Type the User Name and Password associated with the specified gateway.

Optionally test the connection:

  • Select the Apache Spark Version used on your cluster.
  • Select the Kerberos connection type.
  • Click Test.

Set the Connection Mode to the coding language to use in the Apache Spark Code tool.

HDFS Connection

Select the Server Configuration option that matches the HDFS protocol used to communicate with the cluster.

HTTPFS

Type the Host IP Address or DNS name for the HDFS name node within your Apache Spark cluster.
Type the Port number. The default port will be populated automatically.

WebHDFS

Type the Host IP Address or DNS name for the HDFS name node within your Apache Spark cluster.
Type the Port number. The default port will be populated automatically.

Knox Gateway

Type or paste the URL of your Knox gateway.

Optionally type the Username for the HDFS connection.

Optionally type the Password for the HDFS connection.

Select the Kerberos protocol to use.

Advanced Options

Set the Poll Interval (ms), the time between checks from Alteryx for Apache Spark code execution requests. The default is 1,000 ms, or 1 second.

Set the Wait Time (ms), the time that Alteryx waits for execution requests to complete. Operations that take longer than the set wait time result in a time out error. The default is 60,000 ms, or 1 minute.

The Apache Spark Configuration Options customize the created Apache Spark context, and allow advanced users to override the default Apache Spark settings.

By default, the Configuration Option is spark.jars.packages and the Value is com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1. Depending on your Apache Spark version, you may need to override the default value.

Apache Spark version Value
2.0 - 2.1 com.databricks:spark-avro_2.11:3.2.0;com.databricks:spark-csv_2.11:1.5.0
2.2 com.databricks:spark-avro_2.11:4.0.0;com.databricks:spark-csv_2.11:1.5.0
  • Click (+ icon) to add another row to the configuration options table.
  • Click (save icon) to save the current advanced settings as a JSON file. The file can then be loaded into the advanced settings of another connection.
  • Click (open icon) to load a JSON file into the configuration options table.

Select OK to create your Apache Spark Direct connection.

Was This Helpful?

Need something else? Visit the Alteryx Community or contact support.