How Does it Work?
Smart Mastering consists of configuration-driven matching and merging. While the How To Use page will show you how to access this functionality, this page describes the matching and merging processes that make up Smart Mastering.
The Smart Mastering process focuses on one document at a time, identifying any other documents that appear to represent the same entity (Matching) and then combining the values in those documents to create a new document (Merging). Matching and merging are run across either harmonized XML or harmonized JSON documents.
Collections
Smart Mastering Core uses collections to separate types of content. Applications
should use constants.xqy
for the names of the collections. The
most important collection is $CONTENT-COLL
, which contains the current set of
entities that should be used by an application. When a set of documents get
merged, they are moved out of that collection and into the $ARCHIVED-COLL
collection. The generated merged document will be in the $CONTENT-COLL
and
$MERGED-COLL
collections. Documents can also be unmerged, in which case the
merged document will go to the $ARCHIVED-COLL
.
Matching
The matching process begins with a document, which we’ll call the “original” document. This document may be selected because it’s just been inserted into the database, or because a process is cycling through all content.
- The original document gets inserted into the database and the matching process begins.
- The matcher uses the match configuration and the original document to determine the properties and values that will be used for matching.
- The property values are turned into a query that optionally gets combined with a user-provided filtering query used to restrict matches to a set, such as a specific entity type or collection.
- The matcher runs the combined query to identify potential matches for the original document.
The query will be run once, generating a score-ordered sequence of potential matches, each of which is labeled according to a threshold of match probability. A match response will look like this:
<result uri="/source/3/doc3.json" index="1" score="79" threshold="Definitive Match" action="merge">
Smart Mastering expects that the documents it is working with are either all XML or all JSON, rather than mixed. If the content that mastering runs on is mixed, then behavior is undefined.
Matching Algorithms
The default query to look for documents with property values that match the
original document is a cts:element-value-query
, looking for the exact same
value that the original document has. In some cases, you might want to provide
more flexibility in defining what a match is. For those cases, you can specify
an algorithm.
A matching algorithm function takes the value(s) of a single property from the original document, along with the match configuration, and uses that to generate a query to find other documents that have relevant values for the same property. This function will be run once for each original document. This is a normal XQuery or SJS function, so it can make database queries, call out to third-party services, or do anything else needed to generate that query.
To see an example of a matching algorithm function, see zip.xqy.
Merging
Merging takes a set of documents and creates a new document to represent the combination. The structure of the document will match the originals, which are assumed to have been harmonized. The merge configuration controls how property values from the input documents are preserved in the new document.
When two or more documents get merged, they are removed from the
$CONTENT-COLL
. A new document is added to the
$CONTENT-COLL
and to the $MERGED-COLL
. Smart
Mastering Core will record that these documents were combined into the new one,
including the source for each property value in the merged document. This allows
an application to observe the history of a document and its properties, as well
as to undo a merge.
Merging Algorithms
There is a standard algorithm available to combine properties, which is described on the Merging Options page.
Smart Mastering Core also supports custom merge algorithms. This function takes
the xs:QName
for an XML element or a JSON property name, values from the
input documents, and the merging/merge
configuration of this property (see
merging options). The function returns an ordered list of
property values, with the length of the sequence and the ordering defined by the
algorithm. Note that the algorithm does not need to only gather or choose among
values from the input documents; it may choose to aggregate those values.
To see examples of custom algorithms, see the unit tests in the merging
test
suite.
Auditing
By default, Auditing events are automatically stored for merging and unmerging. These auditing events are stored as as prov-o triples and prov-xml xml documents.
NOTE: In order for auditing to work you must have a schemas database assigned to your content database.
Data Model
The Smart Mastering library expects data to use the Entity Services envelope structure. This means that the root of a
document will have envelope
as the local name, with http://marklogic.com/entity-services
as the namespace for XML.
Underneath the envelope
root, Entity Services envelopes have four children (each in the
http://marklogic.com/entity-services
namespace when using XML):
headers
: metadata about the documenttriples
: semantic information that is materialized in the documentinstance
: harmonized propertiesattachments
: source data in its original form
The standard merge algorithm’s recency and source-preference sorting capabilities both rely on identifying the sources.
Smart Mastering expects to find the sources under /es:envelope/es:headers/sm:sources/sm:source
for XML or
/envelope/headers/sources/source
for JSON data. For example in XML
<envelope xmlns="http://marklogic.com/entity-services"
xmlns:sm="http://marklogic.com/smart-mastering">
<headers>
<sm:id>bbc806e4-ff00-4585-9d46-877edbc3248e</sm:id>
<sm:sources>
<sm:source>
<sm:name>SOURCE1</sm:name>
</sm:source>
</sm:sources>
<!-- es:triples, es:instance, and es:attachments not shown -->
</headers>
and in JSON:
{
"envelope": {
"headers": {
"sources": [
{
"name": "MMIS"
}
]
},
// triples, instance, and attachments properties not shown
}
}
For more information about sorting by timestamps, see the Timestamps section of the Merging Options page.
For more information about the Envelope pattern, see What is an Envelope Document in the Entity Services Developer’s Guide.