Navigation

2-8 Loading Content and Metadata from an mlcp Archive File

This example uses the MergeContent processor to join the content and metadata files from an mlcp archive (zip) file into a single flowfile. The key to the merge is the "marklogic.uri" attribute, which is set early in the flow for both the content and metadata files. This attribute is set as the "Correlation Attribute Name" in MergeContent so that the two files are grouped together into one "bin" and merged together. Before the merge, the EvaluateXPath processor is used to parse the metadata file into attributes. The metadata XML is then removed from the flowfile content with ReplaceText so that the content of the two flow files merges cleanly.

MergeContent will only merge content from a single relationship queue, so a funnel is used to aggregate the metadata and content back into a single queue. I set the size of the queue between the funnel and MergeContent to be bigger than the number of documents in my test archive. (25,000) Similarly, MergeContent's "Maximum Number of Bins" should be larger than your archive size. If you are processing more than one archive at a time, these numbers might need to be even larger.

Download Template
Processors:
  • GetFile – reads files from a watched directory
    • Properties
      • Input Directory: /some/path
  • UnpackContent – decompresses the archive file
    • Properties
      • Packaging Format: zip
    • Settings
      • Automatically Terminate Relationships: failure, original
  • UpdateAttribute
    • Properties
      • marklogic.uri: ${ path:replaceAll('\\\\', '/'):append('/'):append( ${ filename:replaceAll('\.metadata$', '') } ) }
  • RouteOnAttribute
    • Properties
      • isMetadata: ${filename:endsWith(".metadata")}
      • isNotMetadata: ${filename:endsWith(".metadata"):not()}
  • EvaluateXPath
    • Properties
      • Destination: flowfile-attribute
      • marklogic.collections: string-join(/*[local-name()='com.marklogic.contentpump.DocumentMetadata']/*[local-name()='collectionsList']/*[local-name()='string'], ',')
      • marklogic.permissions: string-join(/*[local-name()='com.marklogic.contentpump.DocumentMetadata']/*[local-name()='permissionsList']/*[local-name()='string'], ',')
      • marklogic.format: string(/*[local-name()='com.marklogic.contentpump.DocumentMetadata']/*[local-name()='format']/*[local-name()='name'])
      • marklogic.quality: string(/*[local-name()='com.marklogic.contentpump.DocumentMetadata']/*[local-name()='quality'])
    • Settings
      • Automatically Terminate Relationships: failure
  • ReplaceText
    • Properties
      • Search Value: (?s)^(.*)$
      • Replace Value: (leave empty, check "Set empty string")
      • Replacement Strategy: Always Replace
    • Settings
      • Automatically Terminate Relationships: failure
  • MergeContent
    • Properties
      • Attribute Strategy: Keep All Unique Attributes
      • Correlation Attribute Name: marklogic.uri
      • Max Bin Age: 1000 days
      • Maximum Number of Bins: 25000
    • Settings
      • Automatically Terminate Relationships: failure, original
  • InvokeHTTP – HTTP PUT to MarkLogic REST API /LATEST/documents
    • Properties
      • HTTP Method: PUT
      • Remote URL: http://localhost:8000/LATEST/documents?uri=${marklogic.uri}&collection=${marklogic.collections}&format=${marklogic.format}&quality=${marklogic.quality}
      • Basic Authentication Username: youruser
      • Basic Authentication Password: yourpassword
    • Settings
      • Check all five checkboxes under "Automatically Terminate Relationships"