Skip to main content

Metadata

Metadata is a set of data used to describe the input or output data of a Python SDK plugin. Plugins use Python Arrow data types to exchange data with Alteryx Designer. The metadata is used to map this data during the conversion to or from the internal Designer data format (Alteryx Multi-threaded Processing or AMP).

Metadata Description

  • name: This is the same as the column name in a dataset.

  • type: The internal AMP Designer type.

  • size: The size of the data, in bytes.

  • scale: Scale is used only for 1 type: fixeddecimal.

  • source: This is a string that describes the origin of the data (for example, if data is read from a file in Designer, this is set to the file path of this input file).

  • description: This is an all-purpose string field.

Why Use Metadata?

It's not mandatory to use metadata in Python plugins. The simplest method is typically to create a Python Arrow schema directly, as the engine derives the necessary types when sent to Designer. But some plugins might need to export output data types to Alteryx Designer at plugin creation to build the workflow. In this case, metadata has to be sent to Designer through the __init__ method.

Also, sometimes you might need to fine-tune the data type used by Designer. Metadata is the only way to be sure to map a precise type of data to Designer because Python Arrow types might not allow for that. You might also want to modify fields like source or description to send some specific information through the Designer pipeline. Input metadata can also be used by the plugin for any reason, for example, the source field could be a file path the plugin can use.

Usage

You can use metadata in different places.

__init__ Method

This method is called at plugin creation when Designer requests the output data types from the plugin before the workflow runs. You can send metadata to Designer at this time.

Example

In this example, the data schema is a table with 2 columns of types int16 and string.

def __init__(self, provider: AMPProviderV2):
    provider.push_outgoing_metadata("Output", create_schema({
        "volts": {
            "type": FieldType.int16
        },
        "device" {
            "type": FieldType.string
        }
    }))

on_record_batch Method

This method is called each time Designer sends data through the plugin. You can extract metadata from the input data for information and modify output data metadata. However, we don't recommend modifying the type, size, and scale as unexpected results can occur in Designer. You can modify the source and description.

on_complete Method

On input plugins, a complete schema created with metadata can be exported to Designer for a precise data schema.

Use Cases

Define a Schema at __init__

In this example, we specify type int16 for column volts and type string for column device.

def __init__(self, provider: AMPProviderV2):
    provider.push_outgoing_metadata("Output", create_schema({
        "volts": {
            "type": FieldType.int16
        },
        "device" {
            "type": FieldType.string
        }
    }))

Add Optional Information to the Data Flow

In this case, description metadata is added to the batch.

def on_record_batch(self, batch: "pa.Table", anchor: Anchor) - > None:
    batch = set_metadata(batch, {
        "volts": {
            "description": "Define the tension of the current"
        }
    })

    self.provider.write_to_anchor("Output", batch)

Get Information about the Incoming Data

This example shows how to use the source metadata. In this case, the source metadata is a file path, but it can be anything that describes the address of the source of the data.

def on_record_batch(self, batch: "pa.Table", anchor: Anchor) - > None:
    meta = get_metadata(batch, "volts")
    filepath = meta["source"]
    with f = open(Path(filepath)):
        content = f.read()
        #...additional operations...
    
    self.provider.write_to_anchor("Output", batch)

Specifications

For V2 plugins, metadata is stored in PyArrow tables as a Python dictionary in each column schema. But Arrow doesn’t offer easy methods to access this metadata. The ayx_python_sdk.core.utils module provides functions to ease this access.

set_metadata

set_metadata(table, col_meta={}, schema=None)

This function updates column metadata from the given PyArrow table with a dictionary of column metadata or directly with a PyArrow Schema. It returns the new table with the new metadata. The input table remains unchanged.

Example

This example modifies the metadata of 2 columns, volts and ampere.

def on_record_batch(self, batch: "pa.Table", anchor: Anchor) - > None:
    batch = set_metadata(batch, {
        "volts": {
            "description": "tension"
        },
        "ampere": {
            "source": "https://something.com"
        }
    })

self.provider.write_to_anchor("Output", batch)

get_metadata

get_metadata(table, col_name)

Get all columns' metadata or 1 column’s metadata if col_name is given, from the input PyArrow table.

  • If col_name is given, only the dictionary of metadata for this column is returned.

  • If col_name is not given, it returns a dictionary of column names and their corresponding metadata dictionary.

Examples

def on_record_batch(self, batch: "pa.Table", anchor: Anchor) -> None:
        meta_for_column_volts = get_metadata(batch, "volts")

The above example get_metadata returns the metadata for 1 column, of the following format (in this example):

meta_for_column_volts = {
    "type": 3,
    "size": 2,
    "scale": 0,
    "source": "",
    "description": ""
}

注記

You can see in the above example that a number represents the type. Please refer to the Types section below for the mapping between type IDs and type names.

This example returns all metadata:

def on_record_batch(self, batch: "pa.Table", anchor: Anchor) -> None:
        all_meta = get_metadata(batch)

The result dictionary looks like this:

all_meta = {
    "volts": {
        "type": 3,
        "size": 2,
        "scale": 0,
        "source": "",
        "description": ""
    }
    "device": {
        "type": 9,
        "size": 64,
        "scale": 0,
        "source": "",
        "description": ""
    }
}

Refer to this sequence diagram for a visual representation of the metadata lifecycle:

workflow-metadata-lifecycle.png

You can find a more complete example at Metadata Plugin Example.

Types

The ayx_python_sdk.core.field module provides a definition of the Designer types with the class FieldType.

It defines a mapping between type names and a number:

FieldType.bool = 1
FieldType.byte = 2
FieldType.int16 = 3
FieldType.int32 = 4
FieldType.int64 = 5
FieldType.fixeddecimal = 6
FieldType.float = 7
FieldType.double = 8
FieldType.string = 9
FieldType.wstring = 10
FieldType.v_string = 11
FieldType.v_wstring = 12
FieldType.date = 13
FieldType.time = 14
FieldType.datetime = 15
FieldType.blob = 16
FieldType.spatialobj = 17

If only the type is specified at schema creation, the other metadata fields are automatically filled ("size", "scale", and "source").

The maximum size specified by "size" is set to the biggest when using string types but you can manually specify a size for string types. For example, if you set 4 as the size for a string, all strings in the corresponding column are truncated to 4 characters.

Special Types

  • The type FieldType.blob is not supported yet.

  • The type FieldType.spatialobj supports spatial objects using the text format WKT. In order to use it, "source" metadata must be "WKT" (it's automatically set to it), so the source metadata field should not be modified.

  • The type FieldType.fixeddecimal relies on "size" and "scale" metadata items to specify the size of the integer part (size) and the size of the fractional part (scale).