Tool Icon

Named Entity Recognition

Version:
2022.1
Last modified: October 31, 2022

Use the Named Entity Recognition tool to identify entities, like people, places, and things, in text. The tool leverages the named entity recognition capabilities in the spaCy package. You can use the predefined set of entities or your own custom entities. 

This tool is part of Alteryx Intelligence Suite. Intelligence Suite requires a separate license and add-on installer to Designer. After you install Designer, install Intelligence Suite and start your free trial.

Language support

The Named Entity Recognition tool supports English, French, German, Italian, Portuguese, and Spanish.

Tool Components

The Named Entity Recognition tool has 4 anchors.

  • D input anchor: Connect the text data with entities you want to identify.
  • E input anchor (optional): Connect the data with the custom entities you want to identify. This data must contain the custom entity names and labels you want to use to train the model.
  • D output anchor: Output new columns of data that display information about the entities in your data.
  • M output anchor: Output the model object downstream for use with new data. The model object is compatible with the Predict tool.

Default Model Configuration

Configure the Tool

  1. Drag the tool onto the canvas.
  2. Connect the D input anchor to text data with entities you want to identify.
  3. Select the Language of the text data.
  4. Select the Column with Text.
  5. Run the workflow.

Default English Entity List

  • PERSON: Fictional and non-fictional people.
  • NORP: Nationality, religion, or political group.
  • FAC: Facilities such as buildings, airports, highways, and bridges.
  • ORG: Organizations such as companies, agencies, and institutions.
  • GPE: Geographical entities such as countries, cities, and states.
  • LOC: Non-GPE locations such as mountain ranges, bodies of water, and continents.
  • PRODUCT: Products such as vehicles and foods. Excludes services.
  • EVENT: Events such as named hurricanes, wars, and sports events.
  • WORK_OF_ART: Works of art such as books, songs, and movies.
  • LAW: Named documents made into laws.
  • LANGUAGE: Named languages.
  • DATE: Date entity.
  • TIME: Time entity, less than a day.
  • PERCENT: Percentage, includes "%" and the word "percent."
  • MONEY: Monetary value, includes the unit.
  • QUANTITY: Measurements such as height, weight, and distance.
  • ORDINAL: Ordinal entities such as first, second, and third.
  • CARDINAL: Numerals that don't fall under another numerical category.

You can find the default entity lists for the other languages in the spaCy documentation.

Custom Model Configuration

If you want to use your own custom entities to train the model, select Train with New Entities. Your source content must contain at least 20 instances of each custom entity. Connect your custom entities to the E input anchor.

Custom Entity List Format

You can use the Text Input tool to pass your own custom entities to the E input anchor. The tool uses your entity list to train a new model. The entity list format is as follows with a few examples:

Entity Label
Riesling GRAPE
Sauvignon Blanc GRAPE
Pinot Noir GRAPE
Syrah GRAPE
Cabernet Sauvignon GRAPE

Configure the Tool

  1. Drag the tool onto the canvas.
  2. Connect the D input anchor to text data with entities you want to identify.
  3. Connect the E input anchor to your custom entity list.
  4. Select the Language of the text data connected to the D input anchor.
  5. Select the Column with Text from the text data connected to the D input anchor.
  6. Select Train with New Entities.
  7. Select the Column with Entities from the custom entity list connected to the E input anchor.
  8. Select the Column with Labels from the custom entity list connected to the E input anchor.
  9. Select the box if you want your model to be Case Sensitive.
  10. (Optional) Configure the Train Model section. Refer to the following section for details.
  11. Run the workflow.

Train Model

Epochs

An epoch is a single pass (forward and backward) of all data in a training set through a neural network. Epochs are related to iterations, but not the same. An iteration is a single pass of all data in a batch of a training set.

Increasing the number of epochs allows the model to learn from the training set for a longer time. But doing that also increases the computational expense.

You can increase the number of epochs to help reduce error in the model. But at some point, the amount of error reduction might not be worth the added computational expense. Also, increasing the number of epochs too much can cause problems of overfitting, while not using enough epochs can cause problems of underfitting.

By default, the tool uses 10 epochs.

Early Stopping

Early stopping is a method that tells an iterative machine learning method, like the convolutional neural network used in the Named Entity Recognition tool, when to stop learning. Named Entity Recognition uses F1 as the metric for early stopping.

Early stopping is helpful when your model has problems of overfitting. Overfitting occurs when your model learns by memorizing the answers, rather than identifying the underlying patterns in your data. You can also use early stopping to prevent the algorithm from running through unnecessary epochs.

Use early stopping if you're concerned that your model might overfit your data or that additional epochs won't improve your model.

By default, the tool uses early stopping.

Batch Size

A batch is a subset of the entire training dataset.

Decreasing the batch size allows you to stagger how much data passes through a neural network at any given time. Doing that allows you to train models without taking up as much memory as you would if passing all data through a neural network at once. Sometimes batching can speed up training. But breaking your data into batches might also increase error in the model.

Separate your data into batches when your machine is unable to process all the data at once, or if you want to reduce training time.

By default, the tool uses a batch size of 32.

Output

The D output anchor adds 2 columns to the output:

  • entities: This column contains a JSON output with a list of entity tags and descriptions.
    • entity: Entity found by the model.
    • label: The entity label.
    • character_index: The index of the 1st character of the word in the body of text. The index starts at 0.
    • word_index: The index of the word in the body of text. The index starts at 0.
    • entity_length: Character length of the entity.
  • entity_diagram: This column contains your text with labeled entities and is visible with the Browse tool.

The M output anchor contains a model object. You can save the model object and use it on new data with the Predict tool.

How to Parse JSON Output

To transform the JSON output to tabular data, use a combination of the JSON ParseText To Columns, and Cross Tab tools in this example flow:

  1. Pass the Named Entity Recognition tool output to the JSON Parse tool input.
  2. Select the entities column under JSON Field.
  3. Select Output values into single string field.
  4. Pass the JSON Parse tool output to the Text To Columns input.
  5. Select the JSON name column under Column to split and set Delimiters to a period (.).
  6. Select Split to columns and set Number of columns to 3.
  7. Pass the Text to Columns tool output to the Cross Tab tool input.
  8. Cross Tab tool configuration:
    1. Group data by these values: Select the column name containing your original text data and the second split JSON name column (by default this is JSON_Name2).
    2. Change Column Headers: Select the third split JSON name column (by default this is JSON_Name3).
    3. Values for New Columns: Select the JSON_ValueString.
    4. Method for Aggregating Values: Select Concatenate.
  9. Run your workflow. The Cross Tab tool output now contains the tabular form of the Named Entity Recognition tool output.

FAQ

How is the hierarchy determined when I use a custom entity and label list?

The algorithm prioritizes your custom list first.

Are the entities normalized? For example, can the NER tool recognize that Michael Jordan and Air Jordan are the same person?

By default, no–NER can't recognize Michael Jordan and Air Jordan as the same person out of the box. However, you can train a new model to do this by passing a custom entity and label list. Note, the NER tool is not a substitute for find and replace. The algorithm might pick up other nicknames for Michael Jordan based on associations in the source data.

Does NER support mixed language?

No, NER will only support the specified language. For example, if you specify English, NER will only look for the English text within the source data. If your source data contains text in other languages supported by NER (for example, French text), you can create another NER process on your canvas for the French text and join the results at the end.

Was This Page Helpful?

Running into problems or issues with your Alteryx product? Visit the Alteryx Community or contact support. Can't submit this form? Email us.