Navigation

2-3-1 Augment XML content with data from a Web Service

Thanks to Philip Fennell for contributing this recipe

In the following example we will take a location name (place name) from an XHTML fragment and send a request to GeoNames to look-up the location name and return it's Latitude and Longitude to be included in the result document along with the original info in the HTML fragment.

To help set the context, but make this example more concise, it has been taken out of a larger NiFi Flow that takes a tabular list of sporting events from an HTML web page, extracts the parent table and then the table's rows before attempting to look-up each event's location in Geonames. The final XML document to be created takes other information about the event from each table row and merges the location (lat/long) data to produce a set of individual event documents for ingestion into MarkLogic. At first glance this would seem quite a simple task but unless you understand how to handle augmenting your source data with information from external services in NiFi it turns out to be less obvious than you might have thought.

The key takeaway in this example is knowing that you can direct the result of some processor steps into a Flow File attribute instead of it simply replacing the Flow File's content.

To simplify the example further, a single XHTML table row fragment, see attached and below, is retrieved from the file system.

                                    
<?xml version="1.0" encoding="UTF-8"?>
<tr title="Organizer:Tom Judges:Tom, Dick ,Harry, Medic:Dr. Mop" xmlns="http://www.w3.org/1999/xhtml">
  <td>2018-03-24</td>
  <td>
    <a href="showevent.php?idCompetitions=2019">Bristol Blue 2018</a>
  </td>
  <td>Bristol, United Kingdom</td>
  <td>Pool Competition</td>
  <td>&nbspDYN</td>
  <td>&nbspDNF</td>
  <td>&nbspSTA</td>
  <td/>
  <td/>
  <td/>
</tr>
                                    
          
  • Download Template
  • Processors:
    • GetFile – starts a flow on a timer
      • Scheduling
        • Run Schedule: 1 day - (will activate, and run only once, each time the processor is started).
      • Properties
    • EvaluateXPath ("Extract Location Name from XHTML")
      • Properties
        • Destination:flowfile-attribute
      • Custom attributes:
        • EventID: substring-after(/*:tr/*:td[2]/*:a/@href, 'idCompetitions=')
        • URLEncodedLocation: encode-for-uri(replace(/*:tr/*:td[3], '\n', ' '))
    • InvokeHTTP ("Search GeoNames for Location")
      • Properties
        • HTTP Method: GET
        • Remote URL: http://api.geonames.org/search?username=geonames-username&style=short&type=json&maxRows=1&q=${URLEncodedLocation}
        • * Put Response Body in Attribute: GeoNamesResponse
        • Please note: that you will need to sign-up for a GeoNames account/username in order to use this service effectively.
        • * => this is the key property in this processing step, it sends the Web Service's response into the named customer attribute so you have access to both it and the source XHTML table row fragment that you will need later. If not set, this step will only return the Web Service response and you'll no longer have the original event information available down-stream.
    • UpdateAttribute ("Add Lat/Long as Attributes")
      • Properties
        • filename: ${EventID}.xml
      • Custom properties:
        • Latitude: ${GeoNamesResponse:jsonPath('$.geonames[0].lat')}
        • Longitude: ${GeoNamesResponse:jsonPath('$.geonames[0].lng')}
        • Here we use the NiFi expression language to firstly update the filename property that originates from the GetFile processor, this will give a unique name for the resulting document.
        • Secondly we get the Lat/Long info from the GeoNamesResponse Flow File Attribute by evaluating a JSONPath expression against the Web Service's response document and storing the result in the Latitude and Longitude attributes. 
    • TransformXML ("Create XML Document With All Data")
      • Properties
        • XSLT file name: path to the attached XSLT transform.
      • Custom Properties:
        • LATITUDE: ${Latitude}
        • LONGITUDE: ${Longitude}
        • These custom properties are important to the transform as the allow you to bind the Flow File attributes to XSLT transform external parameters so that the lat/long values can be accessed from within the transform.
    • PutFile - writes the resulting Flow Files's content to the file system.
      • Properties
        • Directory: path to the destination directory for the output result.
              
<?xml version="1.0" encoding="UTF-8"?>
<event id="2019" date="2018-03-24" type="Pool Competition">
   <source href="http://www.apnearanking.se/showevent.php?idCompetitions=2019"/>
   <name>Bristol Blue 2018 - incorporating the UK BFA National Pool Championships</name>
   <location lat="51.45523" lng="-2.59665" geonameId="2654675">
      <name>Bristol, United Kingdom</name>
   </location>
   <organiser>Tom</organiser>
   <judges>
      <judge>Tom</judge>
      <judge>Dick</judge>
      <judge>Harry</judge>
   </judges>
   <medic>Dr. Mop</medic>
   <disciplines dyn="true" dnf="true" sta="true"/>
</event>