How Does it Work?

Smart Mastering consists of configuration-driven matching and merging. While the How To Use page will show you how to access this functionality, this page describes the matching and merging processes that make up Smart Mastering.

The Smart Mastering process focuses on one document at a time, identifying any other documents that appear to represent the same entity (Matching) and then combining the values in those documents to create a new document (Merging). Matching and merging are run across either harmonized XML or harmonized JSON documents.

Collections

Smart Mastering Core uses collections to separate types of content. Applications should use constants.xqy for the names of the collections. The most important collection is $CONTENT-COLL, which contains the current set of entities that should be used by an application. When a set of documents get merged, they are moved out of that collection and into the $ARCHIVED-COLL collection. The generated merged document will be in the $CONTENT-COLL and $MERGED-COLL collections. Documents can also be unmerged, in which case the merged document will go to the $ARCHIVED-COLL.

Matching

The matching process begins with a document, which we’ll call the “original” document. This document may be selected because it’s just been inserted into the database, or because a process is cycling through all content.

Matching Process

  1. The original document gets inserted into the database and the matching process begins.
  2. The matcher uses the match configuration and the original document to determine the properties and values that will be used for matching.
  3. The property values are turned into a query that optionally gets combined with a user-provided filtering query used to restrict matches to a set, such as a specific entity type or collection.
  4. The matcher runs the combined query to identify potential matches for the original document.

The query will be run once, generating a score-ordered sequence of potential matches, each of which is labeled according to a threshold of match probability. A match response will look like this:

<result uri="/source/3/doc3.json" index="1" score="79" threshold="Definitive Match" action="merge">

Smart Mastering expects that the documents it is working with are either all XML or all JSON, rather than mixed. If the content that mastering runs on is mixed, then behavior is undefined.

Matching Algorithms

The default query to look for documents with property values that match the original document is a cts:element-value-query, looking for the exact same value that the original document has. In some cases, you might want to provide more flexibility in defining what a match is. For those cases, you can specify an algorithm.

A matching algorithm function takes the value(s) of a single property from the original document, along with the match configuration, and uses that to generate a query to find other documents that have relevant values for the same property. This function will be run once for each original document. This is a normal XQuery or SJS function, so it can make database queries, call out to third-party services, or do anything else needed to generate that query.

To see an example of a matching algorithm function, see zip.xqy.

Merging

Merging takes a set of documents and creates a new document to represent the combination. The structure of the document will match the originals, which are assumed to have been harmonized. The merge configuration controls how property values from the input documents are preserved in the new document.

When two or more documents get merged, they are removed from the $CONTENT-COLL. A new document is added to the $CONTENT-COLL and to the $MERGED-COLL. Smart Mastering Core will record that these documents were combined into the new one, including the source for each property value in the merged document. This allows an application to observe the history of a document and its properties, as well as to undo a merge.

Merging Algorithms

There is a standard algorithm available to combine properties, which is described on the Merging Options page.

Smart Mastering Core also supports custom merge algorithms. This function takes the xs:QName for an XML element or a JSON property name, values from the input documents, and the merging/merge configuration of this property (see merging options). The function returns an ordered list of property values, with the length of the sequence and the ordering defined by the algorithm. Note that the algorithm does not need to only gather or choose among values from the input documents; it may choose to aggregate those values.

To see examples of custom algorithms, see the unit tests in the merging test suite.

Auditing

By default, Auditing events are automatically stored for merging and unmerging. These auditing events are stored as as prov-o triples and prov-xml xml documents.

NOTE: In order for auditing to work you must have a schemas database assigned to your content database.

Data Model

The Smart Mastering library expects data to use the Entity Services envelope structure. This means that the root of a document will have envelope as the local name, with http://marklogic.com/entity-services as the namespace for XML.

Underneath the envelope root, Entity Services envelopes have four children (each in the http://marklogic.com/entity-services namespace when using XML):

  • headers: metadata about the document
  • triples: semantic information that is materialized in the document
  • instance: harmonized properties
  • attachments: source data in its original form

The standard merge algorithm’s recency and source-preference sorting capabilities both rely on identifying the sources. Smart Mastering expects to find the sources under /es:envelope/es:headers/sm:sources/sm:source for XML or /envelope/headers/sources/source for JSON data. For example in XML

<envelope xmlns="http://marklogic.com/entity-services" 
    xmlns:sm="http://marklogic.com/smart-mastering">
  <headers>
    <sm:id>bbc806e4-ff00-4585-9d46-877edbc3248e</sm:id>
    <sm:sources>
      <sm:source>
        <sm:name>SOURCE1</sm:name>
      </sm:source>
    </sm:sources>
    <!-- es:triples, es:instance, and es:attachments not shown -->
  </headers>

and in JSON:

{
  "envelope": {
    "headers": {
      "sources": [
        {
          "name": "MMIS"
        }
      ]
    },
    // triples, instance, and attachments properties not shown
  }
}

For more information about sorting by timestamps, see the Timestamps section of the Merging Options page.

For more information about the Envelope pattern, see What is an Envelope Document in the Entity Services Developer’s Guide.