Configure JDBC Ingestion

This section describes some of the configuration options for the JDBC (relational) ingestion, which supports faster execution of JDBC-based jobs.

Data ingestion works by streaming a JDBC source into a temporary storage space in the base storage layer to stage the data for job execution. The job can then be run on Photon or Spark. When the job completes, Trifacta removes the temporary data from base storage or retains it in the cache (if it is enabled).

Data ingestion happens for Spark and Trifacta Photonjobs.
Data ingestion applies only to JDBC sources that are not native to the running environment. For example, Trifacta does not support JDBC ingestion for Hive.
Trifacta retains Schema information from the schematized source and applies it during publication of the generated results.
Supported for HDFS and other large-scale backend datastores.

Data caching refers to the process of ingesting and storing data sources on the Trifacta nodefor a period of time for faster access if they are needed for additional platform operations.

Tip

Data ingestion and data caching can work together. For more information on data caching, see Configure Data Source Caching.

Job Type	JDBC Ingestion Enabled only	JDBC Ingestion and Caching Enabled
transformation job	Data is retrieved from the source and stored in a temporary backend location for use in sampling.	Data is retrieved from the source for the job and refreshes the cache where applicable.
sampling job	See previous.	Cache is first checked for valid data objects. Outdated objects are retrieved from the data source. Retrieved data refreshes the cache. Note Caching applies only to full scan sampling jobs. Quick scan sampling is performed in the Trifacta Photonrunning environment. As needed you can force an override of the cache when executing the sample. Data is collected from the source. See Samples Panel.

Job Type

JDBC Ingestion Enabled only

JDBC Ingestion and Caching Enabled

transformation job

Data is retrieved from the source and stored in a temporary backend location for use in sampling.

Data is retrieved from the source for the job and refreshes the cache where applicable.

sampling job

See previous.

Cache is first checked for valid data objects. Outdated objects are retrieved from the data source.

Retrieved data refreshes the cache.

Note

Caching applies only to full scan sampling jobs. Quick scan sampling is performed in the Trifacta Photonrunning environment.

As needed you can force an override of the cache when executing the sample. Data is collected from the source. See Samples Panel.

Recommended Table Size

Although there is no absolute limit, you should avoid executing jobs on tables over several 100 GBs. Larger data sources can significantly impact end-to-end performance.

Note

This recommendation applies to all JDBC-based jobs.

Performance

Rule of thumb:

For a single job with 16 ingest jobs occurring in parallel, maximum expected transfer rate is 1 GB/minute.

Scalability:

1 ingest job per source, meaning a dataset with 3 sources = 3 ingest jobs.
Rule of thumb for max concurrent jobs for a similar edge node:
```
max concurrent sources = max cores - cores used for services
```
- Above is valid until the network becomes a bottleneck. Internally, the above maxed out at about 15 concurrent sources.
- Default concurrent jobs = 16, pool size of 10, 2 minute timeout on pool. This is to prevent overloading of your database.
- Adding more concurrent jobs once network has bottleneck will start slow down all the transfer jobs simultaneously.
If processing is fully saturated (# of workers is maxed):
- max transfer can drop to 1/3 GB/minute.
- Ingest waits for two minutes to acquire a connection. If after two minutes a connection cannot be acquired, the job fails.
When job is queued for processing:
- Job is silently queued and appears to be in progress.
- Service waits until other jobs complete.
- Currently, there is no timeout for queueing based on the maximum number of concurrent ingest jobs.

Enable

To enable JDBC ingestion and performance caching, the first two of the following parameters must be enabled.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Parameter Name	Description
webapp.connectivity.ingest.enabled	Enables JDBC ingestion. Default is `true`.
feature.jdbcIngestionCaching.enabled	Enables caching of ingested JDBC data. Note `webapp.connectivity.ingest.enabled` must be set to `true` to enable JDBC caching. When disabled, no caching of JDBC data sources is performed. For more information on caching, see Configure Data Source Caching.
feature.enableLongLoading	When enabled,you can monitor the ingestion of long-loading JDBC datasets through the Import Data page. Default is`true`. Tip After a long-loading dataset has been ingested, importing the data and loading it in the Transformer page should perform faster.
feature.enableParquetLongLoading	When enabled, you can monitor the ingestion of long-loading Parquet datasets. Default is `false`.
longloading.addToFlow	When long-loading is enabled, set this value to `true` to enable monitoring of the ingest process when large relational sources are added to a flow. Default is `true`. See Flow View Page.
longloading.addToLibrary	When long-loading is enabled, this feature enables monitoring of the ingest process when large relational sources are added to the library. Default is `true`. See Library Page.

Configure

In the following sections, you can review the available configuration parameters for JDBC ingest.

You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see Platform Configuration Methods.

Configure Ingestion

Parameter Name	Description
batchserver.workers.ingest.max	Maximum number of ingester threads that can run on the Designer Cloud Powered by Trifacta platformat the same time.
batchserver.workers.ingest.bufferSizeBytes	Memory buffer size while copying to backend storage. A larger size for the buffer yields fewer network calls, which in rare cases may speed up ingest.
batch-job-runner.cleanup.enabled	Clean up after job, which deletes the ingested data from backend storage. Default is `true`. Note If JDBC ingestion is disabled, relational source data is not removed from platform backend storage. This feature can be disabled for debugging and should be re-enabled afterward. Note This setting rarely applies if JDBC ingest caching has been enabled.

Parameter Name

Description

batchserver.workers.ingest.max

Maximum number of ingester threads that can run on the Designer Cloud Powered by Trifacta platformat the same time.

batchserver.workers.ingest.bufferSizeBytes

Memory buffer size while copying to backend storage.

A larger size for the buffer yields fewer network calls, which in rare cases may speed up ingest.

batch-job-runner.cleanup.enabled

Clean up after job, which deletes the ingested data from backend storage. Default is true.

Note

If JDBC ingestion is disabled, relational source data is not removed from platform backend storage. This feature can be disabled for debugging and should be re-enabled afterward.

Note

This setting rarely applies if JDBC ingest caching has been enabled.

Logging

Parameter Name	Description
data-service.systemProperties.logging.level	When the logging level is set to `debug`, log messages on JDBC caching are recorded in the data service log. Note Use this setting for debug purposes only, as the log files can grow quite large. Lower the setting after the issue has been debugged. See Logging below.

Parameter Name

Description

data-service.systemProperties.logging.level

When the logging level is set to debug, log messages on JDBC caching are recorded in the data service log.

Note

Use this setting for debug purposes only, as the log files can grow quite large. Lower the setting after the issue has been debugged.

See Logging below.

Configure Long Loading

For JDBC sources, you can enable the execution of ingestion of long-loading datasets to occur asynchronously. Use this feature to resume working in the application while the loading process completes.

Parameter	Description
`feature.enableLongLoading`	When enabled, the Designer Cloud application loads long-loading datasets asynchronously. You can monitor the ingestion of long-loading JDBC datasets through the Import Data page. Default is true. Note After a long-loading dataset has been ingested, importing the data and loading it in the Transformer page should perform faster.
`feature.parquetLongLoading.enabled`	When enabled, the Designer Cloud application loads large Parquet files asynchronously. You can monitor the ingestion of long-loading Parquet datasets. The default is false.
`feature.parquetLongLoading.sampleLoadTransformMaxBytes`	The maximum number of bytes for a sample load transform job from a long-loading Parquet data source.
`feature.parquetLongLoading.limitTransformMaxBytes`	Maximum number of bytes for a transform job for a long-loading Parquet data source.
`longloading.addToFlow`	When long-loading is enabled, set this value to true to enable monitoring of the ingest process when a user adds large relational sources to a flow. The default is true.
`longloading.addToLibrary`	When long-loading is enabled, this feature enables monitoring of the ingest process when a user adds large relational sources to the library. The default is true.

Monitoring Progress

You can use the following methods to track progress of ingestion jobs.

Through application: In the Job History page, you can track progress of all jobs, including ingestion. Where there are errors, you can download logs for further review.
- See Job History Page.
- See Logging below.
Through APIs:
- You can track status of jobType=ingest jobs through the API endpoints.
- From the above endpoint, get the ingest jobId to track progress.
- See https://api.trifacta.com/ee/9.7/index.html#operation/getJobGroup

Logging

During and after an ingest job, you can download the job logs through the Job History page. Logs include:

All details including errors
Progress on ingest transfer
Record ingestion

See Job History Page.

Configure JDBC Ingestion

Recommended Table Size

Performance

Enable

Configure

Configure Long Loading

Monitoring Progress

Logging

Search results