The FlashText algorithm can replace or find keywords in a given text in one pass over a document. The time complexity isn't dependent on the number of terms being searched or replaced. Given a document of size N (characters) and a dictionary of M keywords, the time complexity is O(N). This algorithm is much faster than regex because regex time complexity is O(M*N).
The algorithm matches only compete words, with boundary characters on both sides. If it's searching for Apple. it won't find Pineapple. The algorithm also finds the longest match first.
Regex is considerably slow when there are thousands of terms to be searched for/replaced, or when having millions of documents. As the number of terms increase, time taken by regex increases almost linearly, and FlashText is almost a constant.
FlashText is an algorithm based on Trie dictionary and inspired by the Aho Corasick Algorithm. First, it takes all relevant keywords as inputs and using them, creates a trie dictionary.
For an input string, we iterate over it character by character. When a sequence of characters in the document match the trie dictionary sequence, we consider it as a complete match. We add the matched term to a list of keywords found.
For an input string, we iterate over it character by character. We create an empty return string and when a sequence of characters in the document doesn't match in the trie dictionary, we copy the original word as it is into the return string. When we do have a match, we add the standardised term instead. The return string is a copy of input string, with only matched terms replaced.
Steps:
- Building the trie dictionary
- Searching keywords
- Replacing keywords
We start with the root node, empty dictionary. We insert a word in the dictionary by inserting the first character to the root node and pointing that to an empty dictionary. The next character from the word goes as a key in this dictionary, and that again points to an empty dictionary. We do this with all words. At the end of the word we insert a special key to signify end of term, eot.
Searching for keywords, replacing keywords code in repo.