Apache Spark on Microsoft Azure HDInsight
Use these instructions to learn how to connect to Microsoft Azure HDInsight and create an Alteryx connection string.
Type of Support: | In-Database |
Validated On: | Apache Spark 2.0+ |
Distributions Validated On: |
Microsoft Azure HDInsight |
Connection Type: | REST/HTML server |
Server Details: | Microsoft Azure information can be found here. |
Alteryx tools used to connect
- Connect In-DB Tool, Data Stream In Tool, and Apache Spark Code Tool (in-database workflow processing)
Additional Details
Using the Microsoft Azure HDInsight Connection window, create a new connection to Microsoft Azure HDInsight using the Microsoft Azure HDInsight option. Use the instructions below to configure the connection.
Configure the Microsoft Azure HDInsight Connection window
To connect to Microsoft Azure HDInsight and create an Alteryx connection string:
- Add a new In-DB connection, setting Data Source to Apache Spark on Microsoft Azure HDInsight. For more information on setting up an In-DB connection, see Connect In-DB Tool.
- On the Read tab, the Driver is set to Apache Spark on Microsoft Azure HDInsight. Click the Connection String drop-down arrow and select New database connection.
- Configure the Microsoft Azure HDInsight Connection window.
Microsoft Azure HDInsight Configuration:
- Configure the Azure URL.
- Type or paste the Azure URL for your Microsoft Azure HDInsight connection. Example: https://<clustername>.azurehdinsight.net/.
- Type the User Name and Password associated with the connection.
- Contact your administrator to find out the user name and password for the cluster administrator user that you configured during set up of your Microsoft Azure HDInsight cluster.
- Select the Apache Spark Version used on your cluster.
- Click Test to test the connection.
- Set the Connection Mode to the coding language to use in the Apache Spark Code tool.
- Connect to your Microsoft Azure storage account.
- Enter the Storage URL for the storage (for example, Microsoft Azure Blob Storage, Microsoft Azure Data Lake Storage, or other primary storage) you want to use with your connection. The HTTPS protocol is required for this URL.
- Enter the Tenant ID GUID. This is found in the properties under your Microsoft Azure Active Directory > Properties > Directory ID.
- Enter the Client ID. In Microsoft Azure, this information is also known as an Application ID. This is found in the properties under your Microsoft Azure Active Directory > App registrations. More information can be found on the Microsoft Documentation > Get application ID and authentication key page.
- Enter the Client Secret. In Microsoft Azure, this information is generated as an authentication key string from the Application ID. More information can be found on the Microsoft Azure Integrating applications with Azure Active Directory page.
- Set the Poll Interval (ms), the time between checks from Alteryx for Apache Spark code execution requests. The default is 1,000 ms, or 1 second.
- Set the Wait Time (ms), the time that Alteryx waits for execution requests to complete. Operations that take longer than the set wait time result in a time out error. The default is 60,000 ms, or 1 minute.
- The Apache Spark Configuration Options customize the created Apache Spark context, and allow advanced users to override the default Apache Spark settings.
Configuration default
By default, the Configuration Option is spark.jars.packages and the Value is com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1. Depending on your Apache Spark version, you may need to override the default value.
Apache Spark version | Value |
---|---|
2.0 - 2.1 | com.databricks:spark-avro_2.11:3.2.0;com.databricks:spark-csv_2.11:1.5.0 |
2.2 | com.databricks:spark-avro_2.11:4.0.0;com.databricks:spark-csv_2.11:1.5.0 |
- Click (+ icon) to add another row to the configuration options table.
- Click (save icon) to save the current advanced settings as a JSON file. The file can then be loaded into the advanced settings of another connection.
- Click (open icon) to load a JSON file into the configuration options table.
- Click OK to create your Apache Spark on Microsoft Azure HDInsight connection.