Databricks Tables Connections

You can create a connection to Databricks Tables from the Designer Cloud Powered by Trifacta platform.

This section describes how to create connections of this type.

Databricks Tables provides a JDBC-based interface for reading and writing datasets in ADLS or WASB. Using the underlying JDBC connection, you can access your ADLS or WASB data like a relational datastore, run jobs against it, and write results back to the datastore as JDBC tables.
Your connection to Databricks Tables leverages the SSO authentication that is native to Databricks.
- For more information on Azure Databricks Tables, see https://docs.microsoft.com/en-us/azure/databricks/data/tables.

Supported Environments:

Operation	Designer Cloud Powered by Trifacta Enterprise Edition	Amazon	Microsoft Azure
Read	Not supported	Supported	Supported
Write	Not supported	Supported	Supported

Limitations

Ad-hoc publishing of generated results to Databricks Tables is not supported.
Integration with Kerberos or secure impersonation is not supported.
Some table types and publishing actions are not supported.
Access to external Hive metastores is not supported.

Prerequisites

Azure: The Designer Cloud Powered by Trifacta platform must be installed on Azure and integrated with an Azure Databricks cluster.
- See Install for Azure.
- See Configure for Azure Databricks.
  Note
  For job execution on Spark, the connection must use the Spark instance on the Azure Databricks cluster. No other Spark instance is supported. You can run jobs from this connection through the Photon running environment. For more information, see Running Environment Options.
AWS: The Designer Cloud Powered by Trifacta platform must be installed on AWS and integrated with an AWS Databricks cluster.
- See Install for AWS.
- See Configure for AWS Databricks.
  Note
  For job execution on Spark, the connection must use the Spark instance on the AWS Databricks cluster. No other Spark instance is supported. You can run jobs from this connection through the Photon running environment. For more information, see Running Environment Options.
This connection interacts with Databricks Tables through the Hive metastore that has been installed in the Databricks cluster.
Note
External Hive metastores are not supported.

Insert Databricks Access Token

Each user must insert a Databricks Personal Access Token into the user profile. For more information, see Databricks Settings Page.

Enable

To enable Databricks Tables connections, please complete the following:

Note

Typically, you need only one connection to Databricks Tables, although you can create multiple connections.

Note

This connection is created with SSL automatically enabled.

Tip

Your SSL certificate should be created using 2047-bit or larger keys. In some environments, such as RHEL 8.x, 1024-bit encryption is no longer accepted per default cryptographic policies.

Steps:

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
Locate the following parameter and set it to true:
```
"feature.databricks.connection.enabled": true,
```

To allow for direct publishing of job results to Databricks tables from the Run Job page, you must enable the following parameters. For more information on these settings, see Databricks Tables Table Settings.

Parameter	Description
feature.databricks.enableDeltaTableWrites	Set this value to `true` to enable users to choose to write generated results to Databricks delta tables from the Run Job page.
feature.databricks.enableExternalTableWrites	Set this value to `true` to enable users to choose to write generated results to Databricks external tables from the Run Job page.

Save your changes and restart the platform.

Create Connection

This connection can also be created via API. For details on values to use when creating via API, see Connection Types.

Please create a Databricks connection and then specify the following properties with the listed values:

Note

Host and port number connection information is taken from Databricks and does not need to be re-entered here.

See Configure for Azure Databricks.
See Configure for AWS Databricks.

Property	Description
Connect String options	Please insert any connection string options that you need. Connect String options are not required for this connection.
Test Connection	Click this button to test the specified connection.
Default Column Data Type Inference	Set to`disabled`to prevent theDesigner Cloud Powered by Trifacta platformfrom applying its own type inference to each column on import. The default value is`enabled`.

Connection URL

The properties that you provide are inserted into the following URL, which connects Designer Cloud Powered by Trifacta Enterprise Edition to the connection:

jdbc:spark://<host>:<port>/<database><connect-string-options>

The Connection URL is mostly built up automatically using cluster configuration for the platform.

Connect string options

The connect string options are optional. If you are passing additional properties and values to complete the connection, the connect string options must be structured in the following manner:

;<prop1>=<val1>;<prop2>=<val2>...

where:

<prop> : the name of the property
<val> : the value for the property

delimiters:

; : any set of connect string options must begin and end with a semi-colon.
- A semi-colon can be omitted from the end of the connect string options.
= : property names and values must be separated with an equal sign (=).

Use HTTP

To enable the use of the HTTP protocol, specify the following in the connect string options:

;transportMode=http;

Use SSL

To enable the use of SSL for the connection, specify the following in the connect string options:

;ssl=1;

HTTP Path

When HTTP is enabled, you can specify the path as a connect string option:

;httpPath=sql/protocolv1/o/0/xxxx-xxxxxx-xxxxxxxx;

Authentication

You can specify a Databricks personal access token to use when authenticating to the database using the following connect string options.

;AuthMech=3;UID=token;PWD=<Databricks-personal-access-token>

where:

<Databricks-personal-access-token> = the personal access token of the user who is connecting to the database.

Driver Information

This connection uses the following driver:

Driver name: com.simba.spark.jdbc41.Driver
Driver version: com.simba.jdbc:SparkJDBC41:2.6.11.1014
Driver documentation: https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html

Data Conversion

For more information on how values are converted during input and output with this database, see Databricks Tables Data Type Conversions.

Create via API

API: API Reference

Type: jdbc
Vendor: databricks

Troubleshooting

For more information on error messages for this connection type, see https://kb.databricks.com/bi/jdbc-odbc-troubleshooting.html.

Failure when importing wide Databricks Tables table

If you are attempting to import a table containing a large number of columns (>200), you may encounter an error message similar to the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 408.0 failed 4 times, most recent failure: Lost task 0.3 in stage 408.0 (TID 1342, 10.139.64.11, executor 11): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1426050. To avoid this, increase spark.kryoserializer.buffer.max value.

The problem is that the serializer ran out of memory.

Solution:

To address this issue, you can increase the Kyroserializer buffer size.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.
Locate the spark.props section and add the following setting. Modify 2000 (2GB) depending on whether your import is successful:
```
"spark.kryoserializer.buffer.max.mb": "2000"
```
Save your changes and restart the platform.
Attempt to import the dataset again. If it fails, you can try incrementally raising the above value.

For more information on passing property values into Spark, see Configure for Spark.

Using Databricks Table Connections

Uses of Databricks tables

The Designer Cloud Powered by Trifacta platform can use Databricks Tables for the following tasks:

Create datasets by reading from Databricks Tables tables.
Write data to Databricks Tables.

Table Type	Support	Notes
Databricks managed tables	Read/Write
Delta tables	Read/Write	Note Versioning and rollback of Delta tables is not supported within the Designer Cloud Powered by Trifacta platform. The latest version is always used. You must use external tools to manage versioning and rollback.
External tables	Read/Write	Note When writing to an external table the TRUNCATE and DROP publishing actions are not supported.
Databricks unmanaged tables	Read/Write
Delta Tables (managed and unmanaged tables)	Read/Write
Partitioned tables	Read

The underlying format for Databricks Tables is Parquet.

Before you begin using Databricks tables

Databricks Tables deployment: Your Alteryx administrator must enable use of Databricks Tables.
Databricks Personal Access Token: You must acquire and save a Databricks Personal Access Token into your Alteryx account. For more information, see Databricks Settings Page.

Storing data in Databricks tables

Note

The Designer Cloud Powered by Trifacta platform does not modify source data in Databricks Tables. Datasets sourced from Databricks Tables are read without modification from their source locations.

Reading from Databricks Tables

You can create a Alteryx dataset from a table or view stored in Databricks Tables.

Read support is also available for Databricks Delta Lake.

Note

Custom SQL queries are supported. Multi-statement custom SQL is not supported for Databricks Tables. Custom SQL queries must be a single SELECT statement. For more information, see Create Dataset with SQL.

For more information on how data types are imported from Databricks Tables, see Databricks Tables Data Type Conversions.

Writing to Databricks Tables

You can write data back to Databricks Tables using one of the following methods:

Job results can be written directly to Databricks Tables as part of the normal job execution.
- Data is written as a managed table to DBFS in Parquet format.
- Create a new publishing action to write to Databricks Tables. See Run Job Page.
For more information on how data is converted to Databricks Tables, see Databricks Tables Data Type Conversions.

Ad-hoc Publishing to Databricks Tables

Not supported.

Reference

Supported Versions: n/a

Supported Environments:

Operation	Designer Cloud Powered by Trifacta Enterprise Edition	Amazon	Microsoft Azure
Read	Not supported	Supported	Supported
Write	Not supported	Supported	Supported

Tip

It's easier to create a connection of this type through the UI. Typically, only one connection is needed.

Databricks Tables Connections

Limitations

Prerequisites

Insert Databricks Access Token

Enable

Create Connection

Connection URL

Driver Information

Data Conversion

Create via API

Troubleshooting

Failure when importing wide Databricks Tables table

Using Databricks Table Connections

Uses of Databricks tables

Before you begin using Databricks tables

Storing data in Databricks tables

Reading from Databricks Tables

Writing to Databricks Tables

Ad-hoc Publishing to Databricks Tables

Reference

Search results