PDF to Text
Use the PDF to Text tool to extract text from your PDF files. PDF files might contain a mix of text characters and images of text. Images of text require optical character recognition (OCR) to extract the text characters. The PDF to Text tool can extract text characters directly from PDF files. The tool can also apply OCR to extract text from images that contain text. For scanned documents that are images (for example, JPG, PNG, and BMP files), use the Image to Text tool.
If you select Read Text Content Only, the PDF to Text tool doesn't have a language restriction.
If you select Read Text and Image Content or Risk Score for Text Encoded as Graphics, the tool supports Arabic, English, French, German, Italian, Japanese, Portuguese, Simplified Chinese, and Spanish.
The PDF to Text tool has 3 anchors (2 inputs and 1 output):
- D input anchor: (Optional) Use the D input anchor to connect a list of PDF file paths or a list of directories that contain PDF files. There are multiple ways to connect your list of file paths or directories:
- T input anchor: (Optional) Use the T input anchor to connect annotations from the Image Template tool. Identify regions for text extraction with string and table annotations. Crop images for downstream processing with image annotations.
- Output anchor: Use the output anchor to pass the extracted text data downstream.
Configure the Tool
- Add a PDF to Text tool to the canvas.
- (Optional) Use the D input anchor to pass a list of PDF file paths or a list of directories that contain PDF files to the PDF to Text tool.
- (Optional) Use the T input anchor to pass annotations from the Image Template tool.
- If you’ve connected to the D input anchor, select the column that contains the file paths.
- If you haven't connected the D input anchor, enter the PDF file path. You can edit the file path to point to a folder instead, and then the tool reads in all PDFs from that folder.
- Select one of the Text Extraction Options based on the content contained in the PDF file.
- Select your Output Options.
- Run the workflow.
The PDF to Text tool doesn't support page selection. To select specific pages, filter the output with a Filter tool.
Text Extraction Options
Read Text and Image Content
PDF files might contain a mix of text characters and images of text. Images of text require optical character recognition (OCR) to extract the text characters. For files with images of text, use Read Text and Image Content to directly read text characters and apply OCR to the images of text. The addition of OCR provides complete coverage of all text in your file.
Read Text Content Only
Read text characters directly from your PDF file. Extraction of text characters only is up to 10x faster than OCR and is generally more accurate.
Use Risk Score for Text Encoded as Graphics to provide guidance on whether OCR is necessary to extract all the text on the page. This option is up to 2x faster than OCR. Use Output Image of Page Graphics to include an image of the page graphics in the tool output.
If a page risk score is medium or high, use the Image tool to examine the graphics content of the page. If the PDF to Text tool missed important text in the graphics, then run the page again with the Read Text and Image Content option.
- String: One record per page. Single string for all the text on the page. Includes line return characters.
- Lines: One record per line of text. Single string for the line of text.
- Pipe-delimited Table: One record per page. Pipe-delimited table for all the text on the page.
- Alteryx Table: One record per line of text. Columns include subdivided text based on horizontal spatial overlap within the text.
If you select more than one format, the output includes each format across different rows.
T Input Anchor (Optional)
The PDF to Text tool output changes when you use the T input anchor.
- An additional output column identifies the markup region for each record.
- String and table regions are output in all the output formats you select.
- The PDF to Text tool crops image regions and outputs them as image Blob files. View the image Blob files with the Image tool.