Fuzzy Matching
Fuzzy matches can be explicitly searched for using the LIFTI query syntax, or implied as the default for searches by configuring the index.
LIFTI uses Levenshtein distance to perform fuzzy matches between a search term and tokens in the index. The distance between two words is the number of edits that are required to match them, including:
- insertions: fid would match find
- deletions: foood would match food
- substitutions: frnd would match find
- transpositions: fnid would match find - Transpositions are a special case, because although two characters are affected, it is considered a single edit.
The resulting Levenshtein distance between any matched term and the search term is used to reduce the score of the match. This means that documents containing words that are closer matches will typically be surfaced higher up in the search results.
Configuration
To prevent a combinatorial explosion of potential matches, LIFTI provides two control mechanisms for fuzzy matching:
- Maximum number of edits - the total number of edits that can be used in any potential match. The default for this value is calculated as
search term length/2
which allows for a larger number of edits for longer search terms. Search terms of just a single character will not allow any edits, as the resulting value will be zero (the formula is an integer calculation). - Maximum number of sequential edits - the maximum number of edits that can be found sequentially in any potential match. The default for this value is calculated as
max(1, search term length/4)
. This default allows for a growing number of sequential edits, however this will never drop below a value of one.
When providing your own overrides for these calculations, be aware that if your configuration for either results in a value of zero, then the fuzzy match will become an exact match, as no edits will be allowed.
Example
With a max edits of 3 and max sequential edits of 1:
- feed will not match food because it requires two sequential edits
- redy will not match friendly because it requires 4 insertions
Default values can be configured at the index level, and can either be expressed as a static value, or a value calculated from the length of the search term.