- The Regex replaced âJavascriptingâ with âJavaScriptâ, which solved 1 problem but created another.Some people, when confronted with a problem, think âI know, Iâll use regular expressions.â Now they have two problems.The above quote is from this stack-exchange question and it came true for me.It turns out that Regex is fast if…
- But my corpus had over 20K keywords and 3 Million documents.When I benchmarked my Regex code, I found it was going to take 5 days to complete one run.oh the horrorThe natural solution was to run it in parallel.
- I looked for existing solutions but couldnât find much.So I wrote my own implementation and FlashText was born.Before we get into what is FlashText and how it works, letâs have a look at how it performs for search:Red Line at the bottom is time taken by FlashText for SearchThe chart shown…
- This makes skipping missing words really fast.The FlashText algorithm only went over each character of the input string âI like Pythonâ.
- This is the true power of FlashText algorithm.So when should you use FlashText?Simple Answer: When Number of keywords 500For search FlashText starts outperforming Regex after ~ 500 keywords.Complicated Answer: Regex can search for keywords based special characters like ^,$,*,\d,.
FlashText is a python library. It is really efficient at both extracting keywords and replacing them.
@jetrubyagency: Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.
#business…
When developers work with text, they often need to clean it up first. Sometimes itâs by replacing keywords. Like replacing âJavascriptâ with âJavaScriptâ. Other times, we just want to find out whether âJavaScriptâ was mentioned in a document.Data cleaning tasks like these are standard for most Data Science projects dealing with text.Data Science starts with data cleaning.I had a very similar task to work on recently. I work as a Data Scientist at Belong.co and Natural Language Processing is half of my work.When I trained a Word2Vec model on our document corpus, it started giving synonyms as similar terms. âJavascriptingâ was coming as a similar term to âJavaScriptâ.To resolve this, I wrote a regular expression (Regex) to replace all known synonyms with standardized names. The Regex replaced âJavascriptingâ with âJavaScriptâ, which solved 1 problem but created another.Some people, when confronted with a problem, think âI know, Iâll use regular expressions.â Now they have two problems.The above quote is from this stack-exchange question and it came true for me.It turns out that Regex is fast if the number of keywords to be searched and replaced is in the 100s. But my corpus had over 20K keywords and 3 Million documents.When I benchmarked my Regex code, I found it was going to take 5 days to complete one run.oh the horrorThe natural solution was to run it in parallel. But that wonât help when we reach 10s of millions of documents and 100s of thousands of keywords. There had to be a better way!…
Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.