Harmonizing the Product Data
Now that we have modeled the Product entity we can use the Data Hub Framework’s code scaffolding to create a boilerplate for Harmonizing our data. Recall from earlier that the Data Hub Framework can use the Entity Services model definition to create code.
Click on the Flows tab in the top navigation bar.
- Click on the + icon next to Harmonize Flows
- Type Harmonize Products into the Harmonize Flow Name field
- Click the CREATE button
This time we want to use the default option of Create Structure from Entity Definition. This means that the Data Hub Framework will create boilerplate code based on our Enity model. The code will pre-populate the fields we need to add.
Click on the Harmonize Products flow. You can run the harmonize flow from the Flow Info tab. The other tabs allow you to edit the source code for the generated plugins. Take note that there are five plugins for harmonize flows: collector, content, headers, triples, writer.
There are five plugins because harmonize flows typically run as batch jobs (although not always). The Data Hub Framework first invokes the collector inside of MarkLogic. The collector returns a list of strings. The Data Hub Framework then breaks those stirings into parallel batches and sends each one to the (content, headers, triples, writer) plugins as a transaction.
- collector: returns a list of strings to operate on
- content: returns data to put into the content section of the envelope
- headers: returns data to put into the headers section of the envelope
- triples: returns data to put into the triples section of the envelope
- writer: receives the final envelope and writes it to the database. You can do whatever you like in the writer. The default code inserts the envelope into the database, but you could push the envelope onto a message bus or send a tweet if you like.
Click on the Collector tab.
This collector code is returning a list of URIs, one for every Product document in the staging database. We are using URIs because we intend to create one harmonized document for every ingested staging document.
The code you see is using cts.uris to get values from the URI lexicon. We pass in cts.collectionQuery as the 3rd parameter to constrain our results to only the URIs for documents in the Product collection. We are using
options.entity as the parameter. The Data Hub Framework passes in options from Java to the plugins.
The default options passed in to the plugin are:
- entity: the name of the entity this plugin belongs to
- flow: the name of the flow this plugin belongs to
- flowType: the type of flow being run (input or harmonize)
Click on the Content tab.
The content code receives an id as the first parameter. This id happens to be the URI for a staging Product document. The id can be anything: a URI, a relational row id, a twitter handle, a random number. It’s up to you to decide how to use that id to harmonize your data.
The only modification we need to make to this file is to change the way we look up the sku.
This change will use either sku or SKU depending on which one is found. This covers the case we are trying to solve of two separate field names.
Here is the Final content.sjs file:
After making the code change, Click SAVE.
Now Click on the Flow Info tab.
Let’s Run the flow. Click the RUN HARMONIZE button to start the flow.
Check out the Harmonized Products
After running the Input flow we verified that the job finished. Let’s do that again.
- Click on the Jobs tab.
- Make sure the job finished.
Now let’s explore our Harmonized Data.
- Click on the Browse tab.
- Change Database to FINAL.
- Click Search.
You should see harmonized documents in the search results.
Click on a result to see the raw data.
Congratulations! You just loaded and harmonized your Product data. Up next is doing the same thing for the Order data.