Default tokenization
Example usage
var index = new FullTextIndexBuilder<int>()
.WithDefaultTokenization(o =>o
.AccentInsensitive(true) // Default
.CaseInsensitive(true) // Default
.SplitOnPunctuation(true) // Default
.SplitOnCharacters('%', '#', '@')
.IgnoreCharacters('<', '>')
.WithStemming()
)
.Build();
TokenizerBuilder methods
Text Normalization
IgnoreCharacters(char[])
Configures the tokenizer to ignore certain characters as it is parsing input.
Ignoring characters will prevent them from acting as split characters, so care needs to be taken that your source
text doesn’t words delimited only by ignored characters, otherwise you may end up unexpectedly joining search terms
into one. For example, ignoring the ' character will mean that O'Reilly will be tokenized
as OReilly, but if your source text also contains she said'hello' then she and
saidhello will treated as tokens.
AccentInsensitive(bool)
true: Default The tokenizer will normalize characters with diacritics to common form. e.g. aigües and aigues will be equivalent.
Additionally, characters that can be logically expressed as two characters are expanded, e.g. laering will be equivalent to læring.
false: The tokenizer will be accent sensitive. Searching for aigües will not match aigues.
CaseInsensitive(bool)
true: Default The tokenizer will normalize all characters to uppercase. e.g. Cat and cat will be equivalent.
false: The tokenizer will be case sensitive. Searching for Cat will match Cat but not cat.
WithStemming()
Words will be stemmed using an implementation of the Porter Stemmer algorithm. For example, ABANDON, ABANDONED and ABANDONING will all
be treated as ABANDON. Currently only English is supported.
A custom stemmer can be used by implementing an IStemmer and using WithStemming(new YourStemmerImplementation()).
Word break modifiers
A tokenizer will always break words on separator (Char.IsSeparator) or control (Char.IsControl) characters.
SplitOnPunctuation(bool)
true: Default The tokenizer will split words on punctuation characters (e.g. those that match Char.IsPunctuation(char))
false: Only characters explicitly specified using SplitOnCharacters will be treated as word breaks.
SplitOnCharacters(params char[])
Allows for additional characters to cause word breaks for a tokenizer. E.g. SplitOnCharacters('$', '£').