Skip to main content

Private Data Processing

Private data processing involves running an Alteryx Analytics Cloud (AAC) data processing cluster inside of your own virtual private network (VPC) in AWS, Azure, or Google Cloud Platform. This combination of your infrastructure, together with Alteryx-managed cloud resources and software, is commonly referred to as a private compute plane or a private data processing environment.

Set Up Private Data Processing for Your Cloud Provider

Use these guides to set up private data processing for your cloud service...

Shared Responsibility Model

In the private data handling scenario, Alteryx Analytics Cloud requires clear boundaries of ownership. The shared responsibility matrix represents these boundaries of ownership.

aac_pdp_matric.png

Alteryx provides the specification for the Account-wide resources and the VPC. For private data processing, you are responsible for the implementation of this spec.

With the account and VPC available, Alteryx actively manages the cloud resource infrastructure and deployed software within the VPC.

Resource

Customer

Alteryx

Account/Subscription/Project-Wide Resources

  • Account/Subscription/Project Details

  • IAM Credentials

  • IAM Roles

  • IAM Policies

Specification

Cloud Networking

VPC/Vnet Infrastructure

  • Subnets

  • Routing

  • Endpoints

Specification

Cloud Resources

  • Object Storage

  • IAM Roles and Policies

  • Kubernetes

  • Compute (Virtual Machines)

  • Secret Manager

  • Managed SQL

  • Redis

  • Shared File System

  • Spark Processing

Software

  • Kubernetes On-demand Jobs

  • Kubernetes Long-running Services

  • Virtual Machines

Account/Subscription/Project-Wide Resources

At the highest level, Alteryx requires a set of permissions to run a private data plane. However, you will own the AWS account, Azure subscription, or GCP project and the corresponding IAM credentials and IAM policies.

Virtual Private Cloud

At the next level down, Alteryx defines a specification for the VPC or Vnet. This includes the definition of a number of subnets, CIDR blocks, route tables, and endpoints.

You must implement the VPC or Vnet according to this spec.

Cloud Resources

Once you’ve completed the setup of the AWS account and VPC, Azure subscription and Vnet, or GCP project and VPC, sign in to AAC to trigger the provisioning process that creates your private data processing cluster.

Alteryx creates and manages these cloud resources on your behalf.

Cloud Apps

After you provision the required resources, Alteryx deploys and maintains the software necessary to process your data within the private cluster.

The full list of cloud resources varies depending on which apps you enable within your private data processing environment. For more information, go to the Cloud Apps section on this page.

Cloud Resources

AAC uses automated provisioning pipelines with infrastructure as Code (IaC) to create and maintain these resources for you. AAC uses Terraform Cloud to manage this. Terraform is an IaC tool that lets you define and manage infrastructure resources through human-readable configuration files. Terraform Cloud is a SaaS product provided by Hashicorp. Private data handling resources are created and managed with a set of Terraform files, Terraform Cloud APIs, and private Terraform Cloud agents running on Alteryx infrastructure.

The full list of cloud resources varies depending on which apps you enable within your private data processing environment. The resources might include…

  • Object Storage: Base storage layer for files (For example, uploaded datasets, job outputs, data samples, caching. and other temp engine files).

  • IAM Roles and Policies: Necessary permissions to provision cloud resources and deploy software.

  • Kubernetes: Runs the VM instances for some AAC services and jobs in the data plane.

  • Compute (Virtual Machines): Compute resources required to run jobs and services.

  • Secret Manager: Storage of infrastructure secrets.

  • Redis: Service-to-service messaging within the VPC.

  • Shared File System: Network attached storage.

  • Spark Processing: (If enabled) Processing of large data jobs.

The specific services used vary by public cloud provider as follows:

Service

AWS

Azure

GCP

Object Storage

S3

Blob Storage

Google Storage

IAM Roles and Policies

IAM Roles

IAM Policies

IAM Roles

IAM Policies

IAM Roles

Kubernetes

EKS

AKS

GKE

Compute (Virtual Machines)

EC2

Virtual Machines

Compute Instance

Secrets Management

Secret Manager

Key Vault

Secret Manager

Redis

Amazon MemoryDB

Azure Cache

Google MemoryStore

Shared File System

EFS

Azure Files

Google Filestore

Spark processing

Serverless EMR

N/A

N/A

Cloud Apps

AAC runs a number of jobs and services inside the private data processing environment. The exact combination of infrastructure and software depends on which AAC applications you deploy there. These modules make it possible to only deploy the cloud resources and software that you need for the applications that you want to run.

Each application has a defined module that consists of…

  • Required permissions.

  • Required network setup that includes subnets and IP ranges.

  • Alteryx-managed cloud resources.

  • Alteryx-managed software.

For example, if you want to only deploy Designer Cloud, there are specific permissions and subnets (with IP ranges) that you are required to set up ahead of time. After you perform the setup, you can sign in to AAC and begin the deployment process.

If you want to only deploy Cloud Execution for Desktop, there is a different set of required permissions and subnets and a different box to check in AAC when you perform the deploy.

If you want to deploy both modules into the same private compute plane, you must complete both sets of setup steps, then complete the deploy step for both.

Designer Cloud Module

When you deploy the Designer Cloud module, AAC provisions these cloud resources. 

Required Services

You can find the exact service names for each cloud provider in the Cloud Resources section.

  • Object Storage

  • Kubernetes

  • Compute

  • Secret Manager

  • Redis

  • Shared File System

  • (Optional) Spark Processing

Node Groups and Types

Within the Kubernetes cluster, Alteryx provisions these compute resources for each cloud provider. These node types and priorities might change over time as the cloud provider evolves. For now, Alteryx strikes a balance between a few factors…

  • AMD machine types are less expensive than Intel machine types.

  • Some job types run best with memory-optimized or compute-optimized nodes. However, for some cloud providers, these node types are much more expensive, while the general purpose types are more affordable.

  • AWS allows Alteryx to specify a priority order of node types and provisions them as needed in priority order. Alteryx recommends this order: memory-optimized AMD machine types, then fall back to Intel machine types, then general purpose machine types.

Node Group Type

AWS

Azure

GCP

common

t3a.2xlarge

t3.2xlarge

Standard_D2s_v3

n2d-standard-2

convert

r6a.2xlarge

r6i.2xlarge

m6a.4xlarge

m6i.4xlarge

Standard_B16as_v2

n2d-standard-16

data-system

Same as convert

Same as convert

Same as convert

file-system

Same as convert

Same as convert

Same as convert

photon

Same as convert

Same as convert

Same as convert

The convert, data-system, file-system, and photon node groups have a minimum scale set of 1 and a maximum of 30.

Software

Within the Kubernetes cluster, the Designer Cloud module uses both on-demand jobs and long-running services.

Kubernetes On-demand Jobs

For Kubernetes on-demand jobs, AAC retrieves a container image (from cache or from a central store) and deploys it within an ephemeral pod that lasts for the duration of the job. All executables are in Java or Python.

  • conversion-jobs: Convert datasets from 1 format to another as needed within a workflow.

  • connectivity-jobs: Connect to external data systems at runtime.

  • photon-jobs: Photon is an in-memory prep and blend runtime engine at runtime for smaller dataset sizes.

  • amp-jobs: AMP is an Alteryx in-memory prep and blend runtime engine utilized primarily in Designer Experience.

  • publish-jobs: Write processed data to the output destination specified within the workflow.

Kubernetes Long-running Services

Alteryx uses Argo CD to deploy and maintain long-running services in your Kubernetes cluster. Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes.

Most long-running services in the cluster serve a utility function to allow Alteryx to monitor cluster health, scale the cluster up and down, and import/export secrets to and from the cloud-native key store and the Kubernetes secret store. These services are common to all modules that utilize Kubernetes and only 1 instance of these services will ever run at a time, even when you specify multiple modules that require them.

  • teleport-agent: Sets up a secure way for Alteryx SRE to connect to the cluster for troubleshooting. AAC pulls the helm chart from the https://charts.releases.teleport.dev repository. Alteryx doesn't scan this third-party image.

  • datadog-agent: Collects logs and metrics from the cluster. AAC pulls the helm chart from the https://helm.datadoghq.com repository. Alteryx doesn't scan this third-party image.

  • keda: Auto-scaling of long-running services based on custom metrics with kafka support. Alteryx doesn't scan this third-party image.

  • external-secrets: Import/export between AWS Secret Manager or Key vault secrets to and from Kubernetes Secrets Store. Alteryx doesn't scan this third-party image.

  • cluster-autoscaler: Scale EKS, AKS, or GKE nodes based on pod demand. Alteryx doesn't scan this third-party image.

  • metrics-server: Allow EKS, AKS, or GKE to use the metrics API. Alteryx doesn't scan this third-party image.

  • kubernetes-reflector: Replication of the dockerConfigJson secret across all namespaces. Alteryx doesn't scan this third-party image.

The Designer Cloud module also deploys long-running services that service specific needs.

  • data-service: Connects to external data systems at design-time via the JDBC API. Alteryx developed this service. Snyk scans the image for vulnerabilities.

Cloud Execution for Desktop Module

When you deploy the Cloud Execution for Desktop module, AAC provisions these cloud resources. 

Required Services

The Cloud Execution for Desktop module doesn't utilize Kubernetes. Instead, the module deploys a machine image that contains all the necessary software to execute Designer Desktop workflows. As such, the module only uses the compute service from each cloud provider. The exact service names for each cloud provider are in the Cloud Resources section.

  • Compute

Autoscale Groups and Node Types

Cloud Execution for Desktop deploys 2 or more virtual machines in an autoscale group.

These node types and priorities might change over time as the cloud provider evolves. For now, Alteryx strikes a balance between a few factors…

  • AMD machine types are less expensive than Intel machine types.

  • AWS allows Alteryx to specify a priority order of node types and provisions them as needed in priority order. Alteryx recommends this order: memory-optimized AMD machine types, then fall back to Intel machine types, then general purpose machine types.

AWS

Azure

GCP

Node Type

m5a.4xlarge

Standard_B16as_v2

n2d-standard-16

Software

On a virtual machine, the Cloud Execution for Desktop module runs a few utility services for monitoring as well as the engine workers that process Designer Desktop jobs.

  • cefd-worker: These workers run the Alteryx in-memory engine to initiate connections to data sources, process data, and publish job outputs. Jobs are containerized and run inside a container in the virtual machine.

  • consumer-service: This service consumes messages from a Kafka queue that is fed by an AAC service in the control plane. These messages are the trigger to run a workflow.

  • teleport-agent: Sets up a secure way for Alteryx SRE to connect to the cluster for troubleshooting. AAC pulls the helm chart from the https://charts.releases.teleport.dev repository. Alteryx doesn't scan this third-party image.

  • datadog-agent: Collects logs and metrics from the cluster. AAC pulls the helm chart from the https://helm.datadoghq.com repository. Alteryx doesn't scan this third-party image.

Machine Learning Module

When you deploy the Machine Learning module, AAC provisions these cloud resources. 

Required Services

You can find the exact service names for each cloud provider in the Cloud Resources section.

  • Object Storage

  • Kubernetes

  • Compute

  • Secret Manager

  • Redis

  • Shared File System

  • (Optional) Spark Processing

Node Groups and Types

Within the Kubernetes cluster we provision the following compute resources for each cloud provider. These node types and priorities may change over time as the cloud provider evolves. For now we strike a balance between a few factors:

  • AMD machine types are less expensive than Intel machine types

  • Some job types run best with memory-optimized or compute-optimized nodes, but in some cloud providers, these node types are much more expensive, while the general purpose ones are much more affordable

  • AWS allows us to specify a priority order of node types, it will provision them as needed in priority order. We prefer memory-optimized AMD machine types, then fall back to Intel machine types, then general purpose machine types.

Node group type

AWS

Azure

GCP

common

t3a.2xlarge

t3.2xlarge

Standard_D2s_v3

n2d-standard-2

automl

r6a.2xlarge

r6i.2xlarge

m6a.4xlarge

m6i.4xlarge

Standard_B16as_v2

n2d-standard-16

The automl node group has a minimum scale set of 1 and a maximum of 30.

Software

Within the Kubernetes cluster, the Machine Learning module uses both on-demand jobs and long-running services.

Kubernetes On-demand Jobs

For Kubernetes on-demand jobs, AAC retrieves a container image (from cache or from a central store) and deploys it within an ephemeral pod that lasts for the duration of the job.

  • automl-jobs: Job service for model training and execution.

Kubernetes Long-running Services

Alteryx uses Argo CD to deploy and maintain long-running services in your Kubernetes cluster. Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes.

Most long-running services in the cluster serve a utility function to allow Alteryx to monitor cluster health, scale the cluster up and down, and import/export secrets to and from the cloud-native key store and the Kubernetes secret store. These services are common to all modules that utilize Kubernetes and only 1 instance of these services will ever run at a time, even when you specify multiple modules that require them.

  • teleport-agent: Sets up a secure way for Alteryx SRE to connect to the cluster for troubleshooting. AAC pulls the helm chart from the https://charts.releases.teleport.dev repository. Alteryx doesn't scan this third-party image.

  • datadog-agent: Collects logs and metrics from the cluster. AAC pulls the helm chart from the https://helm.datadoghq.com repository. Alteryx doesn't scan this third-party image.

  • keda: Auto-scaling of long-running services based on custom metrics with kafka support. Alteryx doesn't scan this third-party image.

  • external-secrets: Import/export between AWS Secret Manager or Key vault secrets to and from Kubernetes Secrets Store. Alteryx doesn't scan this third-party image.

  • cluster-autoscaler: Scale EKS, AKS, or GKE nodes based on pod demand. Alteryx doesn't scan this third-party image.

  • metrics-server: Allow EKS, AKS, or GKE to use the metrics API. Alteryx doesn't scan this third-party image.

  • kubernetes-reflector: Replication of the dockerConfigJson secret across all namespaces. Alteryx doesn't scan this third-party image.

Business Continuity

Private data processing environments are available in regions that have at least 3 availability zones. This allows the private data processing environment to run in 2 availability zones and failover to the third.

Backups for the private object storage are your responsibility.

Depending on the job type, data processing jobs run either in an ephemeral pod in a Kubernetes cluster or in a container in a virtual machine. If an outage affects an actively running job, it is likely the job will fail and you will need to rerun it.

Supported Regions

In order to run a private data processing environment in a particular region, AAC has these requirements…

  1. The region must have 3 or more availability zones.

  2. The region must provide the necessary cloud resources as described in the Cloud Resources section.

  3. The region must provide the necessary node types as described in the Cloud Apps section.

Here are the available regions for each cloud provider:

Cloud Global Region

Region

AWS

Azure

GCP

Africa

Johannesburg, South Africa

southafricanorth

Asia Pacific

Delhi, India

asia-south2

Hong Kong

ap-east-1

eastasia

asia-east2

Indonesia

asia-southeast2

Mumbai, India

ap-south-1

asia-south1

Pune, India

centralindia

Osaka, Japan

asia-northeast2

Seoul, South Korea

ap-northeast-2

koreacentral

asia-northeast3

Singapore

ap-southeast-1

southeastasia

asia-southeast1

Sydney, Australia

ap-southeast-2

australiaeast

australia-southeast1

Taiwan

asia-east1

Tokyo, Japan

ap-northeast-1

japaneast

asia-northeast1

Europe

Belgium

europe-west1

Berlin, Germany

europe-west10

Finland

europe-north1

Frankfurt, Germany

eu-central-1

germanywestcentral

europe-west3

Gävle, Sweden

swedencentral

Ireland

eu-west-1

northeurope

London, United Kingdom

eu-west-2

uksouth

europe-west2

Madrid, Spain

europe-southwest1

Milan, Italy

europe-west8

Netherlands

westeurope

europe-west4

Oslo, Norway

norwayeast

Paris, France

eu-west-3

francecentral

europe-west9

Stockholm, Sweden

eu-north-1

Turin, Italy

europe-west12

Warsaw, Poland

polandcentral

europe-central2

Zurich, Switzerland

switzerlandnorth

Middle East

Qatar

qatarcentral

United Arab Emirates

uaenorth

North America

Arizona

westus3

California

us-west2

Iowa

centralus

us-central1

Montreal, Canada

ca-central-1

northamerica-northeast1

Toronto, Canada

canadacentral

Nevada

us-west4

North Virginia

us-east-1

Ohio

us-east-2

us-east5

Oregon

us-west-2

us-west1

South Carolina

us-east1

Texas

southcentralus

Utah

us-west3

Virginia

eastus

us-east4

eastus2

Washington

westus2

South America

São Paulo, Brazil

sa-east-1

brazilsouth

southamerica-east1