Named Entity Recognition
Use the Named Entity Recognition tool to identify entities, like people, places, and things, in text. The tool leverages the named entity recognition capabilities in the spaCy package. You can use our predefined set of entities or your own custom entities.
Tool Components
The Named Entity Recognition tool has 4 anchors.
- The D input anchor: Connect the text data you want to identify entities in.
- The E input anchor (optional): Connect the data with the custom entities you want to identify. This data has to contain the custom entity names and the labels you want to use to train the model to identify the custom entities.
- The D output anchor: Output new columns of data that display information about the entities in your data.
- The M output anchor: Output the model object so you can reuse it later.
Configure the Tool
To use this tool...
- Drag the tool onto the canvas.
- Connect the D input anchor to text data with entities you want to identity.
- Select the Language of the text data.
- Select the Column with Text.
- Run the workflow.
Advanced Configuration
If you want to use your own entities to train the model, select Train with New Entities.
To train with new entities, provide them in data connected to the E input anchor.
Match Entities
- Select the Column with Entities you want to identify in your data. These entities are custom entities.
- Select the Column with Labels the tool can use while training the model to identify your custom entities.
- Check the box if you want your model to be Case Sensitive.
Train Model
An epoch is a single pass (forward and backward) of all data in a training set through a neural network. Epochs are related to iterations, but not the same. An iteration is a single pass of all data in a batch of a training set.
Increasing the number of epochs allows the model to learn from the training set for a longer time. But doing that also increases the computational expense.
You can increase the number of epochs to help reduce error in the model. But at some point, the amount of error reduction might not be worth the added computational expense. Also, increasing the number of epochs too much can cause problems of overfitting, while not using enough epochs can cause problems of underfitting.
By default, the tool uses 10 epochs.
Early stopping is a method that tells an iterative machine learning method, like the convolutional neural network used in the Named Entity Recognition tool, when to stop learning. Named Entity Recognition uses F1 as the metric for early stopping.
Early stopping is helpful when your model has problems of overfitting. Overfitting occurs when your model learns by memorizing the answers, rather than identifying the underlying patterns in your data. You can also use early stopping to prevent the algorithm from running through unnecessary epochs.
Use early stopping if you're concerned that your model might overfit your data or that additional epochs won't improve your model.
By default, the tool uses early stopping.
A batch is a subset of the entire training dataset.
Decreasing the batch size allows you to stagger how much data passes through a neural network at any given time. Doing that allows you to train models without taking up as much memory as you would if passing all data through a neural network at once. Sometimes batching can speed up training. But breaking your data into batches might also increase error in the model.
Separate your data into batches when your machine is unable to process all the data at once, or if you want to reduce training time.
By default, the tool uses a batch size of 32.
FAQ
Yes, you can pass a custom entity and label list through the E input anchor. To do so, the source data must contain at least 20 instances of each custom entity.
The input data should include 1 column for the entity and 1 column for the label. Example:
Entity | Label |
Michael Jordan | Person A |
Air Jordan | Person A |
His Airness | Person A |
MJ | Person A |
Money | Person A |
The algorithm prioritizes your custom list first.
By default, no–NER can't recognize Michael Jordan and Air Jordan as the same person out of the box. However, you can train a new model to do this by passing a custom entity and label list. Note, the NER tool is not a substitute for find and replace. The algorithm might pick up other nicknames for Michael Jordan based on associations in the source data.
NER is available in the following languages: English, French, German, Italian, Portuguese, and Spanish. At this time, NER is not available in Chinese or Japanese.
No, NER will only support the specified language. For example, if you specify English, NER will only look for the English text within the source data. If your source data contains text in other languages supported by NER (for example, French text), you can create another NER process on your canvas for the French text and join the results at the end.