title: Falsehoods Programmers Believe About Search
url: https://opensourceconnections.com/blog/2019/05/29/falsehoods-programmers-believe-about-search/
hash_url: 626468d153ec2e83731dbbd7133af224
As much as anyone I’m a fan of resurrecting trends and memes and pretending it’s cool. In that vein dear friends, I’ve exhumed the venerable “Falsehoods Programmers Believe” party from 4 years ago to bring you one about, no less, Search.
Search is a deceptively complex field, where competence is hard-won through training, practice, and experience. The list stands at a total of 105 falsehoods. I couldn’t mash up the ole 99-problems meme with this to cull 6 unworthy items, because they are all worthy. I will leave you with that brief introduction and, of course, the list:
- Search engines work like databases
- Search can be considered an additional feature just like any other
- Search can be added as a well performing feature to your existing product quickly
- Search can be added as a well performing feature to your existing product with reasonable effort
- Choosing the correct search engine is easy and you will always be happy with your decision
- Once setup, search will work the same way forever
- Once setup, search will work the same way for a while
- Once setup, search will work the same way for the next week
- The default search engine settings will deliver a good search experience
- Customers know what they are looking for
- Customers who know what they are looking for will search for it in the way you expect
- Customers who don’t know what they are looking for will search accordingly
- A customer using the same query twice expects the same results for both searches
- Customers only search for a few terms
- Customers only search for less than some set number of terms
- Customers never copy and paste a whole document into a search bar
- Customers balance quotes and parenthesis
- Customers that don’t balance quotes or parenthesis don’t expect phrasing or grouping
- You can pass the customer query directly into your search engine
- You can write a query parser that will always parse the query successfully
- You will never have to return a query parse error to the customer
- When you find the boolean operator ‘OR’, you always know it doesn’t mean Oregon
- Customers notice their own misspellings
- Customers don’t expect your search to correct misspellings
- It is possible to create a list of all misspellings
- It is possible to create an algorithm to handle all misspellings
- A misspelled word is never the same as another correctly spelled word
- All customers expect spelling correction to work the same
- All customers want their misspellings corrected
- A search should always return results, no matter how absurd
- If you don’t have any results to show, customers won’t mind
- When the perfect results are shown to the customer, they will notice it
- You don’t need to monitor search queries, results, and clicks
- Customers won’t get nervous that you are logging their search activity
- Search queries are not affected by GDPR
- Looking at the data, it is always possible to tell whether a customer found what they were looking for
- Customers will click on what they are looking for when they’ve found it
- You can build a search that works like Google
- You can build a search that works like Google sometimes
- You should use Google as a target for your search
- Customers don’t mind if your search doesn’t work like Google
- Customers don’t expect your search to work like Google
- Customers won’t compare you to Google
- A bad search, no matter how minor nor how rare, will never reflect poorly on your product
- Since Google doesn’t use facets, customers don’t need them
- Facet hit counts are always correct
- Facets have no impact on performance
- You can just cache queries to get performant facets
- Personalized search is easy
- Learning to rank is easy and just requires a plugin
- You have enough data for learning-to-rank
- Over time, you can curate enough data for learning-to-rank
- You don’t need to spend lots of time formatting content for it to work well in your search engine
- Text extraction engines will always produce text that doesn’t need to be post-processed
- All your markup will be stripped as you expect it to be
- Content is well formed
- Content is mostly well formed
- Content is predictably well formed
- Content, sourced from a database and templates, are formed the same
- Content teams treat search as their top priority
- Manually changing content to improve search is easy
- Improving content can be automated with reasonable effort
- Queries for ‘C programming’ and ‘C++ programming’ will produce different results
- Queries for ‘401k’ and ‘401(k)’ will produce the same results
- Tokenization as it works out of the box is right for your content and queries
- Tokenization can be changed to meet the needs of your entire corpus and all queries
- Tokenization can be changed to meet the needs of most of your corpus and most queries
- Tokenization can be conditional
- You should roll your own tokenizer
- You will never have a debate about tokenization
- Regular Expressions for tokenization is a good idea
- Regular Expressions have minimal performance impact
- You will never have a debate about regular expressions
- You should remove stop words
- You should not remove stop words
- You know what the list of stop words should be
- Stop words will never change
- When you find the stopword ‘in’, you know it doesn’t mean Indiana
- It’s easy to make certain things case sensitive
- Case sensitivity is a good idea
- Synonyms are easy
- Synonyms will improve recall in the way you want
- Synonyms have the same relevance in all documents
- Synonyms for Abbreviations and Acronyms always work as you expect
- Synonyms can be extracted from your corpus with natural language processing
- Using Word2Vec for synonyms is a good idea
- Stemming will solve your recall problems
- Lemmatization will solve your recall problems
- Lemmatization dictionaries are static
- Languages don’t change
- Natural language processing (NLP) tools work perfectly
- Incorporating NLP into your analysis pipeline is straightforward
- Search queries are complete sentences and can be accurately tagged with parts of speech
- Showing a list of search suggestions is easy
- Suggestions should just use the out of the box search engine suggestions
- Suggestions should incorporate customer query logs
- Customers would never type anything offensive into your search bar
- Customers would never try to hack you through your search bar
- Customers don’t need highlighting to find what they’ve searched for
- Default highlighters are good enough for all your content and queries
- Making a custom highlighter isn’t too difficult. It’s just matching strings right?
- Making a custom highlighter that is better than the default version will take less than a year
- Turning on caching will solve your performance issues
- Customers don’t expect near real time updates
- 30 second commit time is short enough for everyone