A place to cache linked articles (think custom and personal wayback machine)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

index.md 7.9KB

4 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
  1. title: Falsehoods Programmers Believe About Search
  2. url: https://opensourceconnections.com/blog/2019/05/29/falsehoods-programmers-believe-about-search/
  3. hash_url: 626468d153ec2e83731dbbd7133af224
  4. <p>As much as anyone I’m a fan of resurrecting trends and memes and pretending it’s cool. In that vein dear friends, I’ve exhumed the venerable “Falsehoods Programmers Believe” party from 4 years ago to bring you one about, no less, Search.</p>
  5. <p>Search is a deceptively complex field, where competence is hard-won through training, practice, and experience. The list stands at a total of 105 falsehoods. I couldn’t mash up the ole 99-problems meme with this to cull 6 unworthy items, because they are all worthy. I will leave you with that brief introduction and, of course, the list:</p>
  6. <ul>
  7. <li>Search engines work like databases</li>
  8. <li>Search can be considered an additional feature just like any other</li>
  9. <li>Search can be added as a well performing feature to your existing product quickly</li>
  10. <li>Search can be added as a well performing feature to your existing product with reasonable effort</li>
  11. <li>Choosing the correct search engine is easy and you will always be happy with your decision</li>
  12. <li>Once setup, search will work the same way forever</li>
  13. <li>Once setup, search will work the same way for a while</li>
  14. <li>Once setup, search will work the same way for the next week</li>
  15. <li>The default search engine settings will deliver a good search experience</li>
  16. <li>Customers know what they are looking for</li>
  17. <li>Customers who know what they are looking for will search for it in the way you expect</li>
  18. <li>Customers who don’t know what they are looking for will search accordingly</li>
  19. <li>A customer using the same query twice expects the same results for both searches</li>
  20. <li>Customers only search for a few terms</li>
  21. <li>Customers only search for less than some set number of terms</li>
  22. <li>Customers never copy and paste a whole document into a search bar</li>
  23. <li>Customers balance quotes and parenthesis</li>
  24. <li>Customers that don’t balance quotes or parenthesis don’t expect phrasing or grouping</li>
  25. <li>You can pass the customer query directly into your search engine</li>
  26. <li>You can write a query parser that will always parse the query successfully</li>
  27. <li>You will never have to return a query parse error to the customer</li>
  28. <li>When you find the boolean operator ‘OR’, you always know it doesn’t mean Oregon</li>
  29. <li>Customers notice their own misspellings</li>
  30. <li>Customers don’t expect your search to correct misspellings</li>
  31. <li>It is possible to create a list of all misspellings</li>
  32. <li>It is possible to create an algorithm to handle all misspellings</li>
  33. <li>A misspelled word is never the same as another correctly spelled word</li>
  34. <li>All customers expect spelling correction to work the same</li>
  35. <li>All customers want their misspellings corrected</li>
  36. <li>A search should always return results, no matter how absurd</li>
  37. <li>If you don’t have any results to show, customers won’t mind</li>
  38. <li>When the perfect results are shown to the customer, they will notice it</li>
  39. <li>You don’t need to monitor search queries, results, and clicks</li>
  40. <li>Customers won’t get nervous that you are logging their search activity</li>
  41. <li>Search queries are not affected by GDPR</li>
  42. <li>Looking at the data, it is always possible to tell whether a customer found what they were looking for</li>
  43. <li>Customers will click on what they are looking for when they’ve found it</li>
  44. <li>You can build a search that works like Google</li>
  45. <li>You can build a search that works like Google sometimes</li>
  46. <li>You should use Google as a target for your search</li>
  47. <li>Customers don’t mind if your search doesn’t work like Google</li>
  48. <li>Customers don’t expect your search to work like Google</li>
  49. <li>Customers won’t compare you to Google</li>
  50. <li>A bad search, no matter how minor nor how rare, will never reflect poorly on your product</li>
  51. <li>Since Google doesn’t use facets, customers don’t need them</li>
  52. <li>Facet hit counts are always correct</li>
  53. <li>Facets have no impact on performance</li>
  54. <li>You can just cache queries to get performant facets</li>
  55. <li>Personalized search is easy</li>
  56. <li>Learning to rank is easy and just requires a plugin</li>
  57. <li>You have enough data for learning-to-rank</li>
  58. <li>Over time, you can curate enough data for learning-to-rank</li>
  59. <li>You don’t need to spend lots of time formatting content for it to work well in your search engine</li>
  60. <li>Text extraction engines will always produce text that doesn’t need to be post-processed</li>
  61. <li>All your markup will be stripped as you expect it to be</li>
  62. <li>Content is well formed</li>
  63. <li>Content is mostly well formed</li>
  64. <li>Content is predictably well formed</li>
  65. <li>Content, sourced from a database and templates, are formed the same</li>
  66. <li>Content teams treat search as their top priority</li>
  67. <li>Manually changing content to improve search is easy</li>
  68. <li>Improving content can be automated with reasonable effort</li>
  69. <li>Queries for ‘C programming’ and ‘C++ programming’ will produce different results</li>
  70. <li>Queries for ‘401k’ and ‘401(k)’ will produce the same results</li>
  71. <li>Tokenization as it works out of the box is right for your content and queries</li>
  72. <li>Tokenization can be changed to meet the needs of your entire corpus and all queries</li>
  73. <li>Tokenization can be changed to meet the needs of most of your corpus and most queries</li>
  74. <li>Tokenization can be conditional</li>
  75. <li>You should roll your own tokenizer</li>
  76. <li>You will never have a debate about tokenization</li>
  77. <li>Regular Expressions for tokenization is a good idea</li>
  78. <li>Regular Expressions have minimal performance impact</li>
  79. <li>You will never have a debate about regular expressions</li>
  80. <li>You should remove stop words</li>
  81. <li>You should not remove stop words</li>
  82. <li>You know what the list of stop words should be</li>
  83. <li>Stop words will never change</li>
  84. <li>When you find the stopword ‘in’, you know it doesn’t mean Indiana</li>
  85. <li>It’s easy to make certain things case sensitive</li>
  86. <li>Case sensitivity is a good idea</li>
  87. <li>Synonyms are easy</li>
  88. <li>Synonyms will improve recall in the way you want</li>
  89. <li>Synonyms have the same relevance in all documents</li>
  90. <li>Synonyms for Abbreviations and Acronyms always work as you expect</li>
  91. <li>Synonyms can be extracted from your corpus with natural language processing</li>
  92. <li>Using Word2Vec for synonyms is a good idea</li>
  93. <li>Stemming will solve your recall problems</li>
  94. <li>Lemmatization will solve your recall problems</li>
  95. <li>Lemmatization dictionaries are static</li>
  96. <li>Languages don’t change</li>
  97. <li>Natural language processing (NLP) tools work perfectly</li>
  98. <li>Incorporating NLP into your analysis pipeline is straightforward</li>
  99. <li>Search queries are complete sentences and can be accurately tagged with parts of speech</li>
  100. <li>Showing a list of search suggestions is easy</li>
  101. <li>Suggestions should just use the out of the box search engine suggestions</li>
  102. <li>Suggestions should incorporate customer query logs</li>
  103. <li>Customers would never type anything offensive into your search bar</li>
  104. <li>Customers would never try to hack you through your search bar</li>
  105. <li>Customers don’t need highlighting to find what they’ve searched for</li>
  106. <li>Default highlighters are good enough for all your content and queries</li>
  107. <li>Making a custom highlighter isn’t too difficult. It’s just matching strings right?</li>
  108. <li>Making a custom highlighter that is better than the default version will take less than a year</li>
  109. <li>Turning on caching will solve your performance issues</li>
  110. <li>Customers don’t expect near real time updates</li>
  111. <li>30 second commit time is short enough for everyone</li>
  112. </ul>