Navigation

3-3 Invoke HTML Tidy on HTML Content

In the situation where you have to extract information from HTML content, that is not well-formed XHTML, it is common to resort to using HTML Tidy to 'tidy' the HTML and transform it into well-formed XML. If you NiFi data flow has to process HTML(5) and you need the flexibility of XPath and/or XSLT to extract what you need then, unfortunately, NiFi does not currently come with a HTML Tidy processor. mores the petty, but you can get around this by using the ExecuteStreamCommand processor to invoke HTML Tidy as you would do from the command-line. That said, this isn't as straightforward as you might imagine.

The main problem is that the ExecuteStreamCommand processor does not appear to handle, properly, passing arguments to the command being executed. In this case Tidy doesn't understand them. The simplest way to solve this is to create a shell script that contains the command arguments required to handle converting HTML5 into XHTML. The attached shell script defines the output to be XHTML and also declares the HTML5 elements that Tidy does not understand.

Download Shell Script

The ExecuteStreamCommand processor directs the content of the FlowFile to STDIN and takes it's resulting content from STDOUT. When HTML Tidy is invoked without a source or destination filename it accepts input from STDIN and sends it's result to STDOUT.

Download Template

  • GetHTTP– starts a flow on a timer
    • Scheduling
      • Run Schedule: 1 day - (will activate, and run only once, each time the processor is started).
    • Properties
      • URL: source URL for the HTML page.
      • Filename: filename for the HTML page being retrieved.
  • ExecuteStreamCommand ("Invoke HTML Tidy")
    • Properties
      • Command Path: file path to the attached shell script.
      • As stated above, command arguments are not accepted correctly by Tidy so we use a shell script to invoke the command with the desired aguments.
  • UpdateAttribute ("Update filename extension to .xhtml")
    • Properties
      • filename: ${filename:substringBefore('.')}.xhtml
      • Just a simple but of filename munging to get the resulting file to have the correct file extension (.xhtml).
  • PutFile - writes the resulting Flow Files's content to the file system.
    • Properties
      • Directory: path to the destination directory for the output result.