The Spark Code tool is a code editor that creates a Spark context and executes Spark commands directly from Designer.
For additional information, see Spark Direct.
Connect to your Spark cluster directly.
Alternatively, connect directly with the Spark Code tool.
Both methods bring up the Manage In-DB Connections window.
Add a new In-DB connection, setting Data Source to Spark Direct.
For more information on setting up an In-DB connection, see Connect In-DB Tool.
On the Read tab, Driver will be locked to Spark Direct. Click the Connection String drop-down arrow and select New database connection.
Configure the Livy Connection window.
Livy Server Configuration: Select your security preference:
Type or paste the Host IP Address or DNS name of the Livy node within your Spark cluster.
Type the Port used by Livy. The default port is 8998.
Optionally provide the User Name to set user impersonation, the name that Spark will use when running jobs.
Type or paste the URL of your Knox gateway.
Type the User Name and Password associated with the specified gateway.
Optionally test the connection:
Set the Connection Mode to the coding language to use in the Spark Code tool.
Select the Server Configuration option that matches the HDFS protocol used to communicate with the cluster.
Type the Host IP Address or DNS name for the HDFS name node within your Spark cluster.
Type the Port number. The default port will be populated automatically.
Type the Host IP Address or DNS name for the HDFS name node within your Spark cluster.
Type the Port number. The default port will be populated automatically.
Type or paste the URL of your Knox gateway.
Optionally type the Username for the HDFS connection.
Optionally type the Password for the HDFS connection.
Select the Kerberos protocol to use.
Set the Poll Interval (ms), the time between checks from Alteryx for Spark code execution requests. The default is 1,000 ms, or 1 second.
Set the Wait Time (ms), the time that Alteryx waits for execution requests to complete. Operations that take longer than the set wait time result in a time out error. The default is 60,000 ms, or 1 minute.
The Spark Configuration Options customize the created Spark context, and allow advanced users to override the default Spark settings.
By default, the Configuration Option is spark.jars.packages and the Value is com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1. Depending on your Spark version, you may need to override the default value.
Spark version | Value |
---|---|
2.0 - 2.1 | com.databricks:spark-avro_2.11:3.2.0;com.databricks:spark-csv_2.11:1.5.0 |
2.2 | com.databricks:spark-avro_2.11:4.0.0;com.databricks:spark-csv_2.11:1.5.0 |
Click (+ icon) to add another row to the configuration options table.
Click (save icon) to save the current advanced settings as a JSON file. The file can then be loaded into the advanced settings of another connection.
Click (open icon) to load a JSON file into the configuration options table.
Select OK to create your Spark Direct connection.
With a Spark Direct connection established, the Code Editor activates.
Use Insert Code to generate template functions in the code editor.
Import Library creates an import statement.
import package
Read Data creates a readAlteryxData function to return the incoming data as a SparkSQL DataFrame.
val dataFrame = readAlteryxData(1)
Write Data creates a writeAlteryxData function to output a SparkSQL DataFrame.
writeAlteryxData(dataFrame, 1)
Log Message creates a logAlteryxMessage function to write a string to the log as a message.
logAlteryxMessage("Example message")
Log Warning creates a logAlteryxWarning function to write a string to the log as a warning.
logAlteryxWarning("Example warning")
Log Error creates a logAlteryxError functions to write a string to the log as an error.
logAlteryxError("Example error")
Import Library creates an import statement.
from module import library
Read Data creates a readAlteryxData function to return the incoming data as a SparkSQL DataFrame.
dataFrame = readAlteryxData(1)
Write Data creates a writeAlteryxData function to output a SparkSQL DataFrame.
writeAlteryxData(dataFrame, 1)
Log Message creates a logAlteryxMessage function to write a string to the log as a message.
logAlteryxMessage("Example message")
Log Warning creates a logAlteryxWarning function to write a string to the log as a warning.
logAlteryxWarning("Example warning")
Log Error creates a logAlteryxError functions to write a string to the log as an error.
logAlteryxError("Example error")
Import Library creates an import statement.
library(jsonlite)
Read Data creates a readAlteryxData function to return the incoming data as a SparkSQL DataFrame.
dataFrame <- readAlteryxData(1)
Write Data creates a writeAlteryxData function to output a SparkSQL DataFrame.
writeAlteryxData(dataFrame, 1)
Log Message creates a logAlteryxMessage function to write a string to the log as a message.
logAlteryxMessage("Example message")
Log Warning creates a logAlteryxWarning function to write a string to the log as a warning.
logAlteryxWarning("Example warning")
Log Error creates a logAlteryxError functions to write a string to the log as an error.
logAlteryxError("Example error")
Use Import Code to pull in code created externally.
Click the gear icon to change cosmetic aspects of the code editor.
Select the output channel metainfo you want to manage.
Manually change the Spark Data Type of existing data.
Click the plus icon to add a data row.
©2018 Alteryx, Inc., all rights reserved. Allocate®, Alteryx®, Guzzler®, and Solocast® are registered trademarks of Alteryx, Inc.