Navigation

1-9 Generate Documents from CSV Files

Method 1

This example introduces the EvaluateXPath processor to extract an ID value from XML data to use in constructing the URI. The XPath value is stored in a FlowFile property which is used later in InvokeHTTP to construct the document URI.

The EvaluateXPath processor currently does not support namespaces. In the ml.xml.id property below, the XPath uses a workaround to match by local-name for XML that is in a namespace.

  • Download Template
  • Controller Services:
    • CSVReader
      • Schema Access Strategy: Use String Fields From Header
      • Schema Name: InferredSchema
      • Schema Text: ${inferred.avro.schema}
      • CSV Format: Microsoft Excel
      • Treat First Line as Header: true
    • AvroRecordSetWriter
      • Schema Write Strategy: Embed Avro Schema
      • Schema Access Strategy: Inherit Record Schema
      • Schema Name: ${schema.name}
      • Schema Text: ${avro.schema}
  • Processors:
    • GetFile – reads files from a watched directory
      • Properties
        • Input Directory: /some/path
    • SplitRecord
      • Properties
        • Record Reader: CSVReader
        • Record Writer: AvroRecordSetWriter
        • Records Per Split: 10000
      • Settings
        • Automatically Terminate Relationships: failure, original
    • SplitAvro
      • Properties
        • (all default)
      • Settings
        • Automatically Terminate Relationships: failure, original
    • ConvertAvroToJson
      • Properties
        • (all default)
      • Settings
        • Automatically Terminate Relationships: failure
    • EvaluateJsonPath - Store values from JSON in FlowFile properties
      • Properties
        • Destination: flowfile-attribute
        • date.of.stop: $.Date_Of_Stop (custom property)
        • geolocation: $.Geolocation (custom property)
        • time.of.stop: $.Time_Of_Stop (custom property)
      • Settings
        • Automatically Terminate Relationships: failure, unmatched
    • UpdateAttribute – to build the MarkLogic URI
      • Properties
        • ml.uri: /${date.of.stop}/${time.of.stop:replaceAll(":", "_")}/${geolocation:replaceAll(" ", "_"):replaceAll("\(", ""):replaceAll("\)", ""):replaceAll(",", "")}/${uuid}.json
    • PutMarkLogic
      • Properties
        • DatabaseClient Service: (your MarkLogic DatabaseClient Service)
        • URI Attribute Name: ml.uri
      • Settings
        • Automatically Terminate Relationships: failure, success

Method 2

This example demonstrates how to generate JSON documents from CSV files. We will use the input data and URI structure of the same use case from the MLCP Guide. Our URI construction steps assume that all of the CSV data will have a "last" name value. To handle multiple CSV formats, use the strategies in this earlier example after ConvertAvroToJson.

Ideally there would be a single processor to convert from CSV to JSON or XML. Instead we have to convert via the intermediate Avro format.

Note: This method works fine on smaller CSV files. For CSVs with more than, say, 100K rows, use the SplitRecord method above.

  • Download Template
    • Processors:
      • GetFile – reads files from a watched directory
        • Properties
          • Input Directory: /some/path
      • InferAvroSchema
        • Properties
          • Schema Output Destination: flowfile-attribute
          • Input Content Type: csv
          • Get CSV Header Definition From Data: true
          • Avro Record Name: MyCSV
        • Settings
          • Automatically Terminate Relationships: failure, original, unsupported content
      • ConvertCSVToAvro
        • Properties
          • Record Schema: ${inferred.avro.schema}
        • Settings
          • Automatically Terminate Relationships: failure, incompatible
      • SplitAvro
        • Properties
          • (all default)
        • Settings
          • Automatically Terminate Relationships: failure, original
      • ConvertAvroToJson
        • Properties
          • (all default)
        • Settings
          • Automatically Terminate Relationships: failure