A place to cache linked articles (think custom and personal wayback machine)
Du kannst nicht mehr als 25 Themen auswählen Themen müssen mit entweder einem Buchstaben oder einer Ziffer beginnen. Sie können Bindestriche („-“) enthalten und bis zu 35 Zeichen lang sein.

index.md 7.9KB

title: Falsehoods Programmers Believe About Search url: https://opensourceconnections.com/blog/2019/05/29/falsehoods-programmers-believe-about-search/ hash_url: 626468d153

As much as anyone I’m a fan of resurrecting trends and memes and pretending it’s cool. In that vein dear friends, I’ve exhumed the venerable “Falsehoods Programmers Believe” party from 4 years ago to bring you one about, no less, Search.

Search is a deceptively complex field, where competence is hard-won through training, practice, and experience. The list stands at a total of 105 falsehoods. I couldn’t mash up the ole 99-problems meme with this to cull 6 unworthy items, because they are all worthy. I will leave you with that brief introduction and, of course, the list:

  • Search engines work like databases
  • Search can be considered an additional feature just like any other
  • Search can be added as a well performing feature to your existing product quickly
  • Search can be added as a well performing feature to your existing product with reasonable effort
  • Choosing the correct search engine is easy and you will always be happy with your decision
  • Once setup, search will work the same way forever
  • Once setup, search will work the same way for a while
  • Once setup, search will work the same way for the next week
  • The default search engine settings will deliver a good search experience
  • Customers know what they are looking for
  • Customers who know what they are looking for will search for it in the way you expect
  • Customers who don’t know what they are looking for will search accordingly
  • A customer using the same query twice expects the same results for both searches
  • Customers only search for a few terms
  • Customers only search for less than some set number of terms
  • Customers never copy and paste a whole document into a search bar
  • Customers balance quotes and parenthesis
  • Customers that don’t balance quotes or parenthesis don’t expect phrasing or grouping
  • You can pass the customer query directly into your search engine
  • You can write a query parser that will always parse the query successfully
  • You will never have to return a query parse error to the customer
  • When you find the boolean operator ‘OR’, you always know it doesn’t mean Oregon
  • Customers notice their own misspellings
  • Customers don’t expect your search to correct misspellings
  • It is possible to create a list of all misspellings
  • It is possible to create an algorithm to handle all misspellings
  • A misspelled word is never the same as another correctly spelled word
  • All customers expect spelling correction to work the same
  • All customers want their misspellings corrected
  • A search should always return results, no matter how absurd
  • If you don’t have any results to show, customers won’t mind
  • When the perfect results are shown to the customer, they will notice it
  • You don’t need to monitor search queries, results, and clicks
  • Customers won’t get nervous that you are logging their search activity
  • Search queries are not affected by GDPR
  • Looking at the data, it is always possible to tell whether a customer found what they were looking for
  • Customers will click on what they are looking for when they’ve found it
  • You can build a search that works like Google
  • You can build a search that works like Google sometimes
  • You should use Google as a target for your search
  • Customers don’t mind if your search doesn’t work like Google
  • Customers don’t expect your search to work like Google
  • Customers won’t compare you to Google
  • A bad search, no matter how minor nor how rare, will never reflect poorly on your product
  • Since Google doesn’t use facets, customers don’t need them
  • Facet hit counts are always correct
  • Facets have no impact on performance
  • You can just cache queries to get performant facets
  • Personalized search is easy
  • Learning to rank is easy and just requires a plugin
  • You have enough data for learning-to-rank
  • Over time, you can curate enough data for learning-to-rank
  • You don’t need to spend lots of time formatting content for it to work well in your search engine
  • Text extraction engines will always produce text that doesn’t need to be post-processed
  • All your markup will be stripped as you expect it to be
  • Content is well formed
  • Content is mostly well formed
  • Content is predictably well formed
  • Content, sourced from a database and templates, are formed the same
  • Content teams treat search as their top priority
  • Manually changing content to improve search is easy
  • Improving content can be automated with reasonable effort
  • Queries for ‘C programming’ and ‘C++ programming’ will produce different results
  • Queries for ‘401k’ and ‘401(k)’ will produce the same results
  • Tokenization as it works out of the box is right for your content and queries
  • Tokenization can be changed to meet the needs of your entire corpus and all queries
  • Tokenization can be changed to meet the needs of most of your corpus and most queries
  • Tokenization can be conditional
  • You should roll your own tokenizer
  • You will never have a debate about tokenization
  • Regular Expressions for tokenization is a good idea
  • Regular Expressions have minimal performance impact
  • You will never have a debate about regular expressions
  • You should remove stop words
  • You should not remove stop words
  • You know what the list of stop words should be
  • Stop words will never change
  • When you find the stopword ‘in’, you know it doesn’t mean Indiana
  • It’s easy to make certain things case sensitive
  • Case sensitivity is a good idea
  • Synonyms are easy
  • Synonyms will improve recall in the way you want
  • Synonyms have the same relevance in all documents
  • Synonyms for Abbreviations and Acronyms always work as you expect
  • Synonyms can be extracted from your corpus with natural language processing
  • Using Word2Vec for synonyms is a good idea
  • Stemming will solve your recall problems
  • Lemmatization will solve your recall problems
  • Lemmatization dictionaries are static
  • Languages don’t change
  • Natural language processing (NLP) tools work perfectly
  • Incorporating NLP into your analysis pipeline is straightforward
  • Search queries are complete sentences and can be accurately tagged with parts of speech
  • Showing a list of search suggestions is easy
  • Suggestions should just use the out of the box search engine suggestions
  • Suggestions should incorporate customer query logs
  • Customers would never type anything offensive into your search bar
  • Customers would never try to hack you through your search bar
  • Customers don’t need highlighting to find what they’ve searched for
  • Default highlighters are good enough for all your content and queries
  • Making a custom highlighter isn’t too difficult. It’s just matching strings right?
  • Making a custom highlighter that is better than the default version will take less than a year
  • Turning on caching will solve your performance issues
  • Customers don’t expect near real time updates
  • 30 second commit time is short enough for everyone