Custom Match Algorithms

Smart Mastering provides out-of-the-box matching capabilities, but you may want to customize how that matching happens. The default behavior is simple: for a particular property, if two documents have exactly the same values for that property, then the property is a match. Each property is configured with a weight that contributes to a match score between two documents.

If you want to take more control over what it means for two property values to match, you can do so by implementing your own algorithm in a function.

Harmonization and Smart Mastering

Note that Smart Mastering is intended to be run after harmonization, so normally at the least the document structures will be the same; generally, the values should have been standardized as well.

As an example, suppose your document has a “state” property, corresponding to the US state in which a person lives. You might have some sources that use the state’s name (“Pennsylvania”), others that use the 2-letter code (“PA”), and still others that use the state’s official, full name (“Commonwealth of Pennsylvania”). Your harmonization process will normally make sure that not only are the state values from all sources put into the same property name “state-code”, but that the values are standardized using one format (“PA”). In this case, it would not be necessary to use a custom algorithm to compare properties.

Customizing Matching

Sometimes data sources simply have different levels of information. Zip codes are a good example. In the United States, an address includes a zip code, which may have either five (“19106”) or nine (“19106-2320”) digits. A nine-digit zip code identifies a more precise location and is entirely contained within the area of the five-digit zip code that it starts with. zip.xqy implements an algorithm that gives points if the 5-digit portion of a 9-digit zip code matches a 5-digit zip code.

Matching looks for candidate matches for a particular document. It does this by building a query based on configured properties and the values of those properties in the document.

JavaScript

To implement your own algorithm in Javascript, create a function with this signature:

function zipMatch(
  expandValues,
  expandXML,
  optionsXML
)

The expandValues parameter contains the value or values from the document. The expandXML parameter is the portion of expand element of the match options that corresponds to the target property. The optionsXML is the complete match options.

Your function must return zero or more queries. You can return zero if your function decides that this property should not be a factor in matching (for instance, if the original document does not have a value for this property).

XQuery

To implement your own algorithm in XQuery, create a function with this signature:

declare function algorithms:zip-match(
  $expand-values as xs:string*,
  $expand-xml as element(matcher:expand),
  $options-xml as element(matcher:options)
) as cts:query*

The $expand-values parameter contains the value or values from the document. The $expand-xml parameter is the portion of expand element of the match options that corresponds to the target property. The $options-xml is the complete match options.

Your function must return zero or more queries. You can return zero if your function decides that this property should not be a factor in matching (for instance, if the original document does not have a value for this property).

Configuring Options to Use Custom Match Functions

To use your custom match functions, add them to the algorithms section of your match options. The algorithm-ref/algorithmRef used for the expand definitions refers to the name you assign in the algorithms section.

The algorithm needs name, at, function, and for XQuery functions, ns in order to find your custom code. The at property is the absolute path the library module in the modules database that holds your function. ns is the namespace in an XQuery library module. function is the actual name of the function (not including the namespace or prefix for XQuery code).

XML Options

  <algorithms>
    <algorithm 
      name="favorite-color" 
      at="/smart-mastering/match/favorite-color.xqy" 
      namespace="http://example.com/big-hub/smart-mastering/match/favorite-color"
      function="favorite-color"/>
  </algorithms>

JSON Options

    "algorithms": {
      "algorithm": [
        { 
          "name": "favoriteColor", 
          "at": "/smart-mastering/match/favoriteColor.sjs"
          "function": "favoriteColor" 
        }
      ]
    },