Use the Part-of-Speech Tagger tool to identify parts of speech like nouns, verbs, and adjectives from text. Part-of-speech tagging is a common processing step to cleanse, prepare, and enhance data for Natural Language Processing applications. The Part-of-Speech Tagger tool leverages the part-of-speech capabilities in the spaCy package. Part-of-speech tagging accuracy for English is about 97%, and varies slightly for the other supported languages.
This tool is part of Alteryx Intelligence Suite. Intelligence Suite requires a separate license and add-on installer to Designer. After you install Designer, install Intelligence Suite and start your free trial.
The Part-of-Speech Tagger tool supports English, French, German, Italian, Portuguese, and Spanish. The part-of-speech output tags are only available in English.
The Part-of-Speech Tagger tool has 2 anchors:
- Input anchor: Use the input anchor to connect the text data you want to analyze.
- Output anchor: Use the output anchor to pass the tagged text data downstream.
Configure the Tool
- Add a Part-of-Speech Tagger tool to the canvas.
- Use the anchors to connect the Part-of-Speech Tagger tool to the text data you want to use in the workflow.
- Select the Language of the text data.
- Select the Column with Text you want to analyze.
- Run the workflow.
The Part-of-Speech Tagger tool outputs the incoming columns in addition to 2 columns:
- part_of_speech_tags: This column contains a JSON output with a list of part-of-speech tags and descriptions. Each token (word) in a corpus (where each row in the input text column contains a corpus) contains the values listed below within the JSON output.
- text: The tagged word.
- part_of_speech: The coarse-grained part of speech tag.
- part_of_speech_description: The coarse-grained part of speech tag description.
- fine_grained_tag: The fine-grained part of speech tag.
- fine_grained_tag_description: The fine-grained part of speech tag description.
- dependency: The part of speech dependency.
- dependency_description: The part of speech dependency description.
- character_index: The index of the 1st character of the word in the corpus. The index starts at 0.
- word_index: The index of the word in the corpus. The index starts at 0.
- text_length: The length of the word.
- dependency_diagram: This column contains an HTML object of the displaCy tagger dependency diagram that is viewable via the Browse tool.
How to Parse JSON Output
- Pass the Part-of-Speech Tagger tool output to the JSON Parse tool input.
- Select the part-of-speech column under JSON Field.
- Select Output values into single string field.
- Pass the JSON Parse tool output to the Text To Columns input.
- Select the JSON name column under Column to split and set Delimiters to a period (.).
- Select Split to columns and set Number of columns to 3.
- Pass the Text to Columns tool output to the Cross Tab tool input.
- Cross Tab tool configuration:
- Group data by these values: Select the column name containing your original text data and the second split JSON name column (by default this is JSON_Name2).
- Change Column Headers: Select the third split JSON name column (by default this is JSON_Name3).
- Values for New Columns: Select the JSON_ValueString.
- Method for Aggregating Values: Select Concatenate.
- Run your workflow. The Cross Tab tool output now contains the tabular form of the Part-of-Speech Tagger tool output.
Below is a sample dependency diagram for the sentence, "This is a sentence." The coarse-grained part-of-speech tag populates below each word. The description for the coarse-grained part-of-speech tag is in the JSON output under "part_of_speech_description." Each arrow indicates the syntactic dependency between two words. The description for each dependency is in the JSON output under "dependency_description."
Coarse-grained part-of-speech tag descriptions for the dependency diagram above:
- AUX: Auxiliary
- DET: Determiner
- NOUN: Noun
Dependency descriptions for the dependency diagram above:
- nsubj: Nominal Subject
- attr: Attribute
- det: Determiner
The diagram is a visual to help the user see the part-of-speech tags. The diagram also depicts how words are associated. At this stage, the dependencies are only part of the visual and not included in the output.
At this time, the Part-of-Speech Tagger doesn’t work with the Reporting tools. For example, you can’t save the dependency diagram as an image.
The model is cached on the first run and therefore the first run will be slower. For the same text, the workflows will be faster on subsequent runs. Note, the cache does expire and it's possible that the cycle may start over again.