Named Entity Recognition

Version:
2022.1
Last modified: May 16, 2022

Use the Named Entity Recognition tool to identify entities, like people, places, and things, in text. The tool leverages the named entity recognition capabilities in the spaCy package. You can use our predefined set of entities or your own custom entities. 

Tool Components

The Named Entity Recognition tool has 4 anchors.

  • The D input anchor: Connect the text data you want to identify entities in.
  • The E input anchor (optional): Connect the data with the custom entities you want to identify. This data has to contain the custom entity names and the labels you want to use to train the model to identify the custom entities.
  • The D output anchor: Output new columns of data that display information about the entities in your data.
  • The M output anchor: Output the model object so you can reuse it later.

Configure the Tool

To use this tool...

  1. Drag the tool onto the canvas.
  2. Connect the D input anchor to text data with entities you want to identity.
  3. Select the Language of the text data.
  4. Select the Column with Text.
  5. Run the workflow.

Advanced Configuration

If you want to use your own entities to train the model, select Train with New Entities.

To train with new entities, provide them in data connected to the E input anchor.

Match Entities

  1. Select the Column with Entities you want to identify in your data. These entities are custom entities.
  2. Select the Column with Labels the tool can use while training the model to identify your custom entities.
  3. Check the box if you want your model to be Case Sensitive.

Train Model

Epochs

An epoch is a single pass (forward and backward) of all data in a training set through a neural network. Epochs are related to iterations, but not the same. An iteration is a single pass of all data in a batch of a training set.

Increasing the number of epochs allows the model to learn from the training set for a longer time. But doing that also increases the computational expense.

You can increase the number of epochs to help reduce error in the model. But at some point, the amount of error reduction might not be worth the added computational expense. Also, increasing the number of epochs too much can cause problems of overfitting, while not using enough epochs can cause problems of underfitting.

By default, the tool uses 10 epochs.

Early Stopping

Early stopping is a method that tells an iterative machine learning method, like the convolutional neural network used in the Named Entity Recognition tool, when to stop learning. Named Entity Recognition uses F1 as the metric for early stopping.

Early stopping is helpful when your model has problems of overfitting. Overfitting occurs when your model learns by memorizing the answers, rather than identifying the underlying patterns in your data. You can also use early stopping to prevent the algorithm from running through unnecessary epochs.

Use early stopping if you're concerned that your model might overfit your data or that additional epochs won't improve your model.

By default, the tool uses early stopping.

Batch Size

A batch is a subset of the entire training dataset.

Decreasing the batch size allows you to stagger how much data passes through a neural network at any given time. Doing that allows you to train models without taking up as much memory as you would if passing all data through a neural network at once. Sometimes batching can speed up training. But breaking your data into batches might also increase error in the model.

Separate your data into batches when your machine is unable to process all the data at once, or if you want to reduce training time.

By default, the tool uses a batch size of 32.

FAQ

Does the NER tool allow custom entities? For example, I’d like to add “Michael Jordan” as an entity labeled “Person A.”

Yes, you can pass a custom entity and label list through the E input anchor. To do so, the source data must contain at least 20 instances of each custom entity.

Does the input data require a specific format?

The input data should include 1 column for the entity and 1 column for the label. Example:

Entity Label
Michael Jordan Person A
Air Jordan Person A
His Airness Person A
MJ Person A
Money Person A
How is the hierarchy determined when I use a custom entity and label list?

The algorithm prioritizes your custom list first.

Are the entities normalized? For example, can the NER tool recognize that Michael Jordan and Air Jordan are the same person?

By default, no–NER can't recognize Michael Jordan and Air Jordan as the same person out of the box. However, you can train a new model to do this by passing a custom entity and label list. Note, the NER tool is not a substitute for find and replace. The algorithm might pick up other nicknames for Michael Jordan based on associations in the source data.

Which languages does NER support?

NER is available in the following languages: English, French, German, Italian, Portuguese, and Spanish. At this time, NER is not available in Chinese or Japanese.

Does NER support mixed language?

No, NER will only support the specified language. For example, if you specify English, NER will only look for the English text within the source data. If your source data contains text in other languages supported by NER (for example, French text), you can create another NER process on your canvas for the French text and join the results at the end.

Was This Page Helpful?

Running into problems or issues with your Alteryx product? Visit the Alteryx Community or contact support. Can't submit this form? Email us.