Navigation

2-6 Extract Text from PDFs and Office Documents

This example uses the ExtractTextProcessor which is not included with NiFi but was developed by Hortonworks. ExtractTextProcessor uses Apache Tika to extract the text from a wide variety of document formats.

The output from the processor can be html (XHTML) or text. I recommend the html option because it also converts the text to UTF-8. In my testing the text option produced output in a mix of character sets, UTF-8 and Windows-1252, the latter failing on ingest. 

For the purposes of this test I downloaded the pre-built NAR file from here. Drop the NAR file into the NiFi lib directory and restart. All dependencies, including Tika, are included in the NAR.