Table of Contents

  1. Out-of-the-Box Match Algorithms
  2. Configuring Add
    1. Standard Algorithm
  3. Configuring Expand
    1. Double Metaphone
    2. Thesaurus
    3. Zip
  4. Reductions
    1. Standard Reduction

Out-of-the-Box Match Algorithms

Smart Mastering provides the following match algorithms that you can use without having to write any code. A match algorithm generates a query that is used to identify and score potential matches. In addition to the built-in algorithms, you can write your own custom functions.

Configuring Add

The simplest way to look for matches is to configure a property under the scoring/add part of the match options (see Matching Options for a complete example).

Standard Algorithm

The standard algorithm is used for properties configured under the scoring/add of the match options. This algorithm builds a query that looks for exact matches. For example, if there is a property called “state” that holds 2-character abbreviations of US states, and the document for which Smart Mastering is looking for matches has the value “PA”, the standard algorithm will look for other documents that have the value “PA” in their state properties. The query is run with the case-insensitive option.

When configuring scoring/add properties, specify the property-name/propertyName and the weight.

Configuring Expand

To look for inexact matches, configure a property under the scoring/expand section of the match options. You may specify a property here that is also listed under add; in this case, the scores from both sections will be added together. For each property configured under scoring/expand, specify an algorithm and options (for algorithms that support options).

Double Metaphone

Allow matches that are similar in string distance. This algorithm uses a dictionary generated from current content in the database. The query generated by this algorithm is based on values drawn from this dictionary, so it should be regenerated occasionaly as new values are inserted into the database. This is done by re-running the setup-double-metaphone function. This can be done manually or by re-installing the options using matcher:save-options.

Note that to generate the dictionary, there must be a range index on the XML element or JSON property where the values can be found.

To add this algorithm to your match configuration, add XML or JSON like the following, assuming that you have configured a property named “last-name”. Change the weights to work with your other properties.

<options xmlns="http://marklogic.com/smart-mastering/matcher">
  <algorithms>
    <algorithm
      name="double-metaphone"
      function="double-metaphone"
      namespace="http://marklogic.com/smart-mastering/algorithms"
      at="/com.marklogic.smart-mastering/algorithms/double-metaphone.xqy"/>
    </algorithms>
  <scoring>
    <add property-name="last-name" weight="8"/>
    <expand property-name="last-name" algorithm-ref="double-metaphone">
      <distance-threshold>20</distance-threshold>
      <dictionary>/dictionaries/last-names.xml</dictionary>
      <collation>http://marklogic.com/collation/codepoint</collation>
    </expand>
  </scoring>
</options>
{
 "options": {
   "algorithms": {
     "algorithm": [
       {
         "name": "dbl-metaphone",
         "namespace": "http://marklogic.com/smart-mastering/algorithms",
         "function": "double-metaphone",
         "at": "/com.marklogic.smart-mastering/algorithms/double-metaphone.xqy"
       }
     ]
   },
   "scoring": {
     "add": [
       { "propertyName": "last-name", "weight": "8" }
     ],
     "expand": [
       {
         "propertyName": "last-name",
         "algorithmRef": "dbl-metaphone",
         "weight": "8",
         "dictionary": "/dictionaries/last-names.xml",
         "distanceThreshold": 20,
         "collation": "http://marklogic.com/collation/codepoint"
       }
     ]
   }
 }
}

There are three configurable properties for double-metaphone:

  • dictionary: the URI of a dictionary that will be created by the setup script
  • distance-threshold: see https://docs.marklogic.com/spell:suggest for information about how the distance-threshold affects values.
  • collation: used to identify the range index used to populate the dictionaries

Thesaurus

The thesaurus algorithm will look for synonyms based on a provided thesaurus. It will look up the value(s) present in the document that matching is being run on and build a query based on the values found in the thesaurus. The query is built with the case-insensitive option. Note that the lookup of values will be done after converting the original value to lower-case. For instance, if the first-name property has the value “Bill”, the algorithm will look up the value “bill”. The thesaurus might have the following entry:

<thesaurus xmlns="http://marklogic.com/xdmp/thesaurus">
  <entry>
    <term>will</term>
    <synonym>
    <term>bill</term>
    </synonym>
    <synonym>
    <term>billy</term>
    </synonym>
    <synonym>
    <term>william</term>
    </synonym>
  </entry>
</thesaurus>

The algorithm will then construct a case-insensitive query with the values “william”, “will”, and “billy”.

For more information about using thesauri in MarkLogic, including the required schema, see Using the Thesaurus Functions in the Search Developer’s Guide. You can insert a thesaurus using thsr:load/thsr.load or thsr:insert/thsr.insert, which will validate that the content matches the expected schema. You can also directly insert a thesarus into your content database, which skips the validation step. In either case, use the URI at which you insert the thesaurus document to configure the thesaurus option.

To add this algorithm to your match configuration, add XML or JSON like the following, assuming that you have configured a property named “first-name”. Change the weights to work with your other properties.

<options xmlns="http://marklogic.com/smart-mastering/matcher">
  <algorithms>
    <algorithm
      name="thesaurus"
      function="thesaurus"
      namespace="http://marklogic.com/smart-mastering/algorithms"
      at="/com.marklogic.smart-mastering/algorithms/thesaurus.xqy"/>
  </algorithms>
  <scoring>
    <add property-name="first-name" weight="8"/>
    <expand property-name="first-name" algorithm-ref="thesaurus">
      <thesaurus>/dictionaries/first-name-thesaurus.xml</thesaurus>
    </expand>
  </scoring>
</options>
{
  "options": {
    "propertyDefs": {
      "property": [
        { "namespace": "", "localname": "PersonSurName", "name": "first-name" },
      ]
    },
    "algorithms": {
      "algorithm": [
        {
          "name": "thesaurus",
          "namespace": "http://marklogic.com/smart-mastering/algorithms",
          "function": "thesaurus",
          "at": "/com.marklogic.smart-mastering/algorithms/thesaurus.xqy"
        }
      ]
    },
    "scoring": {
      "add": [
        { "propertyName": "first-name", "weight": "8" }
      ],
      "expand": [
        {
          "propertyName": "first-name",
          "algorithmRef": "thesaurus",
          "weight": "8",
          "thesaurus": "/dictionaries/first-name-thesaurus.xml"
        }
      ]
    }
  }
}

There are two configurable properties for thesaurus:

  • thesaurus: the URI of a thesaurus that values will be drawn from. You must supply a thesaurus.
  • filter: corresponds to the filter parameter to https://docs.marklogic.com/thsr:expand

Zip

Allow matches between 5- and 9-digit US ZIP codes. For each zip in original document, this algorithm generates a query to match values that have the same first five digits.

To add this algorithm to your match configuration, add XML or JSON like the following, assuming that you have configured a property named “zip”. Change the weights to work with your other properties.

<options xmlns="http://marklogic.com/smart-mastering/matcher">
  <algorithms>
    <algorithm
      name="zip-code"
      function="zip-match"
      namespace="http://marklogic.com/smart-mastering/algorithms"
      at="/com.marklogic.smart-mastering/algorithms/zip.xqy"/>
  </algorithms>
  <scoring>
    <add property-name="zip" weight="5"/>
    <expand property-name="zip" algorithm-ref="zip-code">
      <zip origin="5" weight="3"/>
      <zip origin="9" weight="2"/>
    </expand>
  </scoring>
</options>
{
  "options": {
    "algorithms": {
      "algorithm": [
        {
          "name": "zip-code",
          "namespace": "http://marklogic.com/smart-mastering/algorithms",
          "function": "zip-match",
          "at": "/com.marklogic.smart-mastering/algorithms/zip.xqy"
        }
      ]
    },
    "scoring": {
      "add": [
        { "propertyName": "zip", "weight": "5" }
      ],
      "expand": [
        {
          "propertyName": "zip",
          "algorithmRef": "zip-code",
          "zip": [
            { "origin": 5, "weight": 3 },
            { "origin": 9, "weight": 2 }
          ]
        }
      ]
    }
  }
}

Effect:

  • If the original document has a 5-digit zip:
    • A potential match with the same 5-digit zip will get 5 points (from the add).
    • A potential match with a 9-digit zip that starts with the same five digits will get (5+3=)8 points.
  • If the original document has a 9-digit zip:
    • A potential match with the same 9-digit zip will get 5 points.
    • A potential match with a 5-digit zip that matches the first five digits of the original document will get 2 points.

Reductions

In some cases, a combination of matching properties may suggest a match when there shouldn’t be one. Consider two relatives living together. When matched, two Person records have the same family name, same street address, city, and zip code. That might be enough points to trigger a match even though the two given names differ.

The reduce element gives a way to back off the scores in such cases. The algorithm-ref/algorithmRef must match the name of an algorithm element under algorithms. The weight attribute will be subtracted from the score if the algorithm matches.

Standard Reduction

To use the standard reduction algorithm, add XML or JSON to your match options like the following, assuming you have configured properties called “last-name” and “addr1”. The specified weight reduction will be applied if the listed properties all match.

Note that this algorithm requires that the match function calculates a list of which matches were scored. This feature is optional when calling match functions directly and disabled when calling match-and-merge.

<options xmlns="http://marklogic.com/smart-mastering/matcher">
  <algorithms>
    <algorithm name="std-reduce" function="standard-reduction"/>
  </algorithms>
  <scoring>
    <reduce algorithm-ref="std-reduce" weight="4">
      <all-match>
        <property>last-name</property>
        <property>addr1</property>
      </all-match>
    </reduce>
  </scoring>
</options>
{
  "options": {
    "algorithms": {
      "algorithm": [
        { "name": "std-reduce", "function": "standard-reduction" },
      ]
    },
    "scoring": {
      "reduce": [
        {
          "algorithmRef": "std-reduce",
          "weight": "4",
          "allMatch": { "property": ["last-name", "addr1"] }
        }
      ]
    }
  }
}