Getting Started Tutorial 1.x

The MarkLogic Data Hub Framework is free and open source under the Apache 2 License and is supported by the community of developers who build and contribute to it. Please note that Data Hub Framework is not a supported MarkLogic product.

This tutorial is for version 1.x of the Data Hub Framework which was designed to work with MarkLogic 8. If you need to take advantage of MarkLogic 9 features, go to the 2.x Getting Started Tutorial

Be advised that this tutorial was created some time ago and has not yet been updated to reflect the most recent UI screens.

Building an HR Hub

This tutorial will walk you through setting up a very simple hub for HR data. Your company Global Corp has provided you with CSV dumps of 3 tables. Additionally you are receiving JSON data files from a recently acquired company Acme Tech. You are responsible for loading data from both systems into your data hub to make them accessible to internal systems.

In a Hurry?

The finished version of this tutorial is available for you to download and play with. Finished HR Hub Example

QuickStart

This tutorial uses QuickStart, a simple User Interface that you can run locally to start working with the Data Hub quickly. With QuickStart you will have a working hub in a matter of minutes. No need to worry about deployment strategies or configuration details. Simply run the war and point it at your MarkLogic installation.

Prerequisites

Before you can run the hub you will need to have some some software installed.

1 - Download and Install MarkLogic

Follow the official instructions for installing MarkLogic.

2 - Download the QuickStart War

  • Create a folder for this hub project and cd into it.
mkdir data-hub
cd data-hub
  • Download the quick-start-*.war from the releases page and place it in the folder you just created.

3 - Download the Sample Data

  • Create a folder to hold your input data
mkdir input

Your directory should look like this:

Directory Tree

4 - Run the QuickStart

  • Open a terminal window in the data-hub directory
  • Run the War
java -jar quick-start-*.war

If you need to run on a different port then add the –server.port option

java -jar quick-start-*.war --server.port=9000

5 - Login to the Hub

After opening the QuickStart Application you must step through a wizard to properly configure the Hub.

Browse to the directory where your hub where live. Hub Directory

Initialize your Data Hub Project Directory. Hub Directory

Click Next. Hub Directory

Choose the Local Environment. Hub Directory

Login to the Hub with your MarkLogic credentials Hub Directory

Install the Hub into MarkLogic Hub Directory

Click Finished Hub Directory

6 - Create Your First Entity

Entities are the business objects that you will be working with in the hub. Start by defining a new Employee Entity. Click the New Entity button. Now fill out the popup with your entity name. If you are using the example code we provide then make sure to name this “Employee”.

New Entity

You have just created an Entity.

7 - Create Your First Flows

Input Flows are responsible for getting data into the Hub staging area.

Harmonize Flows are responsible for batch transformation of data from staging to final.

First you will create an Input and Harmonize flow for Acme Tech. Start by clicking the Employee Entity to expand it. Now click the + button next to the Input Flows then fill out the form.

Name the flow “load-acme-tech” and leave the default values for Plugin Type and Data Format.

New Acme Tech Flows

Now click the + button next to the Harmonize Flows then fill out the form. Name the flow “harmonize-acme-tech” and leave the default values for Plugin Type and Data Format.

New Acme Tech Flows

** It’s important to keep the names the same if you plan on using our code examples. The code examples are performing collection queries tied to the names of these flows.

Next you will want to create an Input and Harmonize flow for Global Corp. Start by clicking the + button next to Input Flows. Then fill out the form. Name the flow “load-global-corp” and leave the default values for Plugin Type and Data Format.

New Acme Tech Flows

Now click the + button next to the Harmonize Flows then fill out the form. Name the flow “harmonize-global-corp” and leave the default values for Plugin Type and Data Format.

New Acme Tech Flows

The Quick Start application automatically deploys your flows to the server for you any time you make a change. You can manually force a redeploy by pressing the Redeploy Button at the bottom left of the screen.

8 - Ingest Acme Tech Data

Now that your entity is created you want to ingest some data. QuickStart uses the MarkLogic Content Pump to ingest data.

  1. Click on the load-acme-tech input flow to highlight it.
  2. Now click on the “Run Flow” button in the right pane. You will see a screen that allows you to configure your MLCP run. Click Acme Tech

  3. Point the browser box to the input/AcmeTech directory. Browse to Input Folder
  4. Expand the General Options section.
  5. Make sure the Documents Data Format is selected. Choose Documents
  6. Now scroll down and press the “Run Import” button. Run Acme Tech

QuickStart will kick off a MarkLogic Content Pump job to ingest the Json documents. You can monitor the progress of the job by navigating to the Jobs tab and clicking on the “Show Console Output” button next to the Job. View Acme Tech Job Output

9 - Ingest Global Corp Data

Now you need to load the data for Global Corp.

  1. Click on the load-global-corp input flow to highlight it.
  2. Now click on the “Run Flow” button in the right pane. You will see a screen that allows you to configure your MLCP run. Click Global Corp
  3. Point the browser box to the input/GlobalCorp directory. Browse to Input Folder
  4. Expand the General Options section.
  5. Choose the Delimited Text Data Format. Choose Delimited Text
  6. Change Document Type to json. Choose Json
  7. Expand the Delimited Text Options section.
  8. Check the Generate URI? option. Click Generate URI
  9. Now scroll down and press the “Run Import” button. Click Global Corp

QuickStart will kick off a MarkLogic Content Pump job to ingest the Json documents. You can monitor the progress of the job by navigating to the Jobs tab and clicking on the “Show Console Output” button next to the Job.

Monitor Job Progress

10 - Prep for Harmonize

All of our data is loaded into the staging area. While it’s possible to harmonize the data right now it’s not very useful. The out of the box harmonize plugins will simply copy the staging data to the final data area.

We are going to enhance the data a bit so that it can be more easily searched and accessed. To do this we will identify some commonalities between our two data sets and choose a few fields to extract into the header section of our final envelopes.

For this tutorial we will pull out 3 headers:

  • employee id
  • hire date
  • salary

Because we are dealing with two separate data sources we will put the logic for each source into its own flow.

Acme Tech Collector

For Acme Tech we want to return a list of URIs. Since the Acme Tech data came to us as JSON documents, there is only one document for every employee.

Use your favorite text editor to open the data-hub/plugins/entities/Employee/harmonize/harmonize-acme-tech/collector/collector.sjs file. Replace its contents with this:

Acme Tech header plugin

Use your favorite text editor to open the data-hub/plugins/entities/Employee/harmonize/harmonize-acme-tech/headers/headers.sjs file. Replace its contents with this:

Global Corp Collector

The collector is a plugin that provides a list of items to the Harmonize flow to be acted upon. By default the out of the box collector will return all document URIs in the system. We need to change this. For Global Corp we want to return a list of Employee IDs. This allows us to iterate over each employee ID and create an employee document per ID.

Use your favorite text editor to open the data-hub/plugins/entities/Employee/harmonize/harmonize-global-corp/collector/collector.sjs file. Replace its contents with this:

Global Corp Content Plugin

For Global corp we are going to use the harmonize step to recreate employee records for every employee ID that is in our staging area. Recall that for the collector we are returning employee IDs instead of URIs. Open up your favorite text editor to the data-hub/plugins/entities/Employee/harmonize/harmonize-global-corp/content/content.sjs file. Replace its contents with this:

Global Corp Header Plugin

Use your favorite text editor to open the data-hub/plugins/entities/Employee/harmonize/harmonize-global-corp/headers/headers.sjs file. Replace its contents with this:

11 - Harmonize the data

You ingested your data. You created plugins that will extract common fields into the headers. You edited the collectors to only operate on certain data. Now you are ready to harmonize. Simply press the Run button next to both harmonize flows.

Running Acme Tech Run Harmonize Flow

Running Global Corp Run Harmonize Flow

As with the Input Flows you can see the job status in the Jobs tab.

12 - Consume the Data

Now you can access your data via several REST endpoints. Your harmonized data is available on the Final HTTP server. The defaul port is 8011. A full list of REST endpoints is available here: https://docs.marklogic.com/REST/client

Open the Staging Search Endpoint against your local instance.

Open the Final Search Endpoint against your local instance.

Picture here is the Final Search endpoint. Rest Search

13 - Wrapping Up

Congratulations! You just created a Data Hub.

  • You loaded JSON and CSV files.
  • You harmonized your data by extracting common header fields.
  • Your data is now fully accessible via the MarkLogic REST API