Table of Contents
Out-of-the-Box Match Algorithms
Smart Mastering provides the following match algorithms that you can use without having to write any code. A match algorithm generates a query that is used to identify and score potential matches. In addition to the built-in algorithms, you can write your own custom functions.
Configuring Add
The simplest way to look for matches is to configure a property under the scoring/add
part of the match options (see
Matching Options for a complete example).
Standard Algorithm
The standard algorithm is used for properties configured under the scoring/add
of the match options. This algorithm
builds a query that looks for exact matches. For example, if there is a property called “state” that holds 2-character
abbreviations of US states, and the document for which Smart Mastering is looking for matches has the value “PA”, the
standard algorithm will look for other documents that have the value “PA” in their state properties. The query is run
with the case-insensitive
option.
When configuring scoring/add
properties, specify the property-name
/propertyName
and the weight
.
Configuring Expand
To look for inexact matches, configure a property under the scoring/expand
section of the match options. You may
specify a property here that is also listed under add
; in this case, the scores from both sections will be added
together. For each property configured under scoring/expand
, specify an algorithm and options (for algorithms that
support options).
Double Metaphone
Allow matches that are similar in string distance. This algorithm uses a dictionary generated from current content
in the database. The query generated by this algorithm is based on values drawn from this dictionary, so it should be
regenerated occasionaly as new values are inserted into the database. This is done by re-running the
setup-double-metaphone
function. This can be done manually or by re-installing the options using
matcher:save-options
.
Note that to generate the dictionary, there must be a range index on the XML element or JSON property where the values can be found.
To add this algorithm to your match configuration, add XML or JSON like the following, assuming that you have configured a property named “last-name”. Change the weights to work with your other properties.
<options xmlns="http://marklogic.com/smart-mastering/matcher">
<algorithms>
<algorithm
name="double-metaphone"
function="double-metaphone"
namespace="http://marklogic.com/smart-mastering/algorithms"
at="/com.marklogic.smart-mastering/algorithms/double-metaphone.xqy"/>
</algorithms>
<scoring>
<add property-name="last-name" weight="8"/>
<expand property-name="last-name" algorithm-ref="double-metaphone">
<distance-threshold>20</distance-threshold>
<dictionary>/dictionaries/last-names.xml</dictionary>
<collation>http://marklogic.com/collation/codepoint</collation>
</expand>
</scoring>
</options>
{
"options": {
"algorithms": {
"algorithm": [
{
"name": "dbl-metaphone",
"namespace": "http://marklogic.com/smart-mastering/algorithms",
"function": "double-metaphone",
"at": "/com.marklogic.smart-mastering/algorithms/double-metaphone.xqy"
}
]
},
"scoring": {
"add": [
{ "propertyName": "last-name", "weight": "8" }
],
"expand": [
{
"propertyName": "last-name",
"algorithmRef": "dbl-metaphone",
"weight": "8",
"dictionary": "/dictionaries/last-names.xml",
"distanceThreshold": 20,
"collation": "http://marklogic.com/collation/codepoint"
}
]
}
}
}
There are three configurable properties for double-metaphone:
dictionary
: the URI of a dictionary that will be created by the setup scriptdistance-threshold
: see https://docs.marklogic.com/spell:suggest for information about how the distance-threshold affects values.collation
: used to identify the range index used to populate the dictionaries
Thesaurus
The thesaurus algorithm will look for synonyms based on a provided thesaurus. It will look up the value(s) present in
the document that matching is being run on and build a query based on the values found in the thesaurus. The query is
built with the case-insensitive
option. Note that the lookup of values will be done after converting the original
value to lower-case. For instance, if the first-name
property has the value “Bill”, the algorithm will look up the
value “bill”. The thesaurus might have the following entry:
<thesaurus xmlns="http://marklogic.com/xdmp/thesaurus">
<entry>
<term>will</term>
<synonym>
<term>bill</term>
</synonym>
<synonym>
<term>billy</term>
</synonym>
<synonym>
<term>william</term>
</synonym>
</entry>
</thesaurus>
The algorithm will then construct a case-insensitive query with the values “william”, “will”, and “billy”.
For more information about using thesauri in MarkLogic, including the required schema, see
Using the Thesaurus Functions in the Search Developer’s Guide. You can insert a thesaurus using
thsr:load
/thsr.load
or
thsr:insert
/thsr.insert
, which will validate that the content matches the
expected schema. You can also directly insert a thesarus into your content database, which skips the validation step.
In either case, use the URI at which you insert the thesaurus document to configure the thesaurus option.
To add this algorithm to your match configuration, add XML or JSON like the following, assuming that you have configured a property named “first-name”. Change the weights to work with your other properties.
<options xmlns="http://marklogic.com/smart-mastering/matcher">
<algorithms>
<algorithm
name="thesaurus"
function="thesaurus"
namespace="http://marklogic.com/smart-mastering/algorithms"
at="/com.marklogic.smart-mastering/algorithms/thesaurus.xqy"/>
</algorithms>
<scoring>
<add property-name="first-name" weight="8"/>
<expand property-name="first-name" algorithm-ref="thesaurus">
<thesaurus>/dictionaries/first-name-thesaurus.xml</thesaurus>
</expand>
</scoring>
</options>
{
"options": {
"propertyDefs": {
"property": [
{ "namespace": "", "localname": "PersonSurName", "name": "first-name" },
]
},
"algorithms": {
"algorithm": [
{
"name": "thesaurus",
"namespace": "http://marklogic.com/smart-mastering/algorithms",
"function": "thesaurus",
"at": "/com.marklogic.smart-mastering/algorithms/thesaurus.xqy"
}
]
},
"scoring": {
"add": [
{ "propertyName": "first-name", "weight": "8" }
],
"expand": [
{
"propertyName": "first-name",
"algorithmRef": "thesaurus",
"weight": "8",
"thesaurus": "/dictionaries/first-name-thesaurus.xml"
}
]
}
}
}
There are two configurable properties for thesaurus:
- thesaurus: the URI of a thesaurus that values will be drawn from. You must supply a thesaurus.
- filter: corresponds to the filter parameter to https://docs.marklogic.com/thsr:expand
Zip
Allow matches between 5- and 9-digit US ZIP codes. For each zip in original document, this algorithm generates a query to match values that have the same first five digits.
To add this algorithm to your match configuration, add XML or JSON like the following, assuming that you have configured a property named “zip”. Change the weights to work with your other properties.
<options xmlns="http://marklogic.com/smart-mastering/matcher">
<algorithms>
<algorithm
name="zip-code"
function="zip-match"
namespace="http://marklogic.com/smart-mastering/algorithms"
at="/com.marklogic.smart-mastering/algorithms/zip.xqy"/>
</algorithms>
<scoring>
<add property-name="zip" weight="5"/>
<expand property-name="zip" algorithm-ref="zip-code">
<zip origin="5" weight="3"/>
<zip origin="9" weight="2"/>
</expand>
</scoring>
</options>
{
"options": {
"algorithms": {
"algorithm": [
{
"name": "zip-code",
"namespace": "http://marklogic.com/smart-mastering/algorithms",
"function": "zip-match",
"at": "/com.marklogic.smart-mastering/algorithms/zip.xqy"
}
]
},
"scoring": {
"add": [
{ "propertyName": "zip", "weight": "5" }
],
"expand": [
{
"propertyName": "zip",
"algorithmRef": "zip-code",
"zip": [
{ "origin": 5, "weight": 3 },
{ "origin": 9, "weight": 2 }
]
}
]
}
}
}
Effect:
- If the original document has a 5-digit zip:
- A potential match with the same 5-digit zip will get 5 points (from the
add
). - A potential match with a 9-digit zip that starts with the same five digits will get (5+3=)8 points.
- A potential match with the same 5-digit zip will get 5 points (from the
- If the original document has a 9-digit zip:
- A potential match with the same 9-digit zip will get 5 points.
- A potential match with a 5-digit zip that matches the first five digits of the original document will get 2 points.
Reductions
In some cases, a combination of matching properties may suggest a match when there shouldn’t be one. Consider two relatives living together. When matched, two Person records have the same family name, same street address, city, and zip code. That might be enough points to trigger a match even though the two given names differ.
The reduce element gives a way to back off the scores in such cases. The algorithm-ref
/algorithmRef
must match the
name of an algorithm element under algorithms. The weight attribute will be subtracted from the score if the
algorithm matches.
Standard Reduction
To use the standard reduction algorithm, add XML or JSON to your match options like the following, assuming you have configured properties called “last-name” and “addr1”. The specified weight reduction will be applied if the listed properties all match.
Note that this algorithm requires that the match function calculates a list of which matches were scored. This feature
is optional when calling match functions directly and disabled when calling match-and-merge
.
<options xmlns="http://marklogic.com/smart-mastering/matcher">
<algorithms>
<algorithm name="std-reduce" function="standard-reduction"/>
</algorithms>
<scoring>
<reduce algorithm-ref="std-reduce" weight="4">
<all-match>
<property>last-name</property>
<property>addr1</property>
</all-match>
</reduce>
</scoring>
</options>
{
"options": {
"algorithms": {
"algorithm": [
{ "name": "std-reduce", "function": "standard-reduction" },
]
},
"scoring": {
"reduce": [
{
"algorithmRef": "std-reduce",
"weight": "4",
"allMatch": { "property": ["last-name", "addr1"] }
}
]
}
}
}