A place to cache linked articles (think custom and personal wayback machine)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

index.md 8.8KB

3 weeks ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
  1. title: Why wordfreq will not be updated
  2. url: https://github.com/rspeer/wordfreq/blob/master/SUNSET.md#i-dont-want-to-be-part-of-this-scene-anymore
  3. hash_url: 1cd6127ccec88387f4804f0c3cf1011a
  4. archive_date: 2024-09-30
  5. og_image: https://opengraph.githubassets.com/23640beea69cc6e481c29b906530677a890056e4755f495489cf39654858b963/rspeer/wordfreq
  6. description: Access a database of word frequencies, in various natural languages. - rspeer/wordfreq
  7. favicon: https://github.githubassets.com/favicons/favicon.png
  8. language: en_US
  9. <p dir="auto">This documentation page has gotten a lot of attention recently! I
  10. think most of the people who find it understand where I'm coming from. I'd
  11. like to highlight a couple of things, now that people are linking to this
  12. page from all sorts of contexts.</p>
  13. <ul dir="auto">
  14. <li>
  15. <p dir="auto">I still work on open-source libraries. Here's <a href="https://github.com/rspeer/python-ftfy">ftfy</a>,
  16. the popular multi-purpose Unicode fixer.</p>
  17. </li>
  18. <li>
  19. <p dir="auto">You could see this freezing of wordfreq data as a good thing. Many people
  20. have found wordfreq useful, and the latest version isn't going away. The
  21. conclusion that I'm documenting here is that <em>updating it would make it
  22. worse</em>, so instead, I'm not updating it. It'll become outdated over time,
  23. but it won't get actively worse. That's a pretty okay fate for something
  24. on the Internet!</p>
  25. </li>
  26. </ul>
  27. <div class="markdown-heading" dir="auto"><h1 tabindex="-1" class="heading-element" dir="auto">Why wordfreq will not be updated</h1><a id="user-content-why-wordfreq-will-not-be-updated" class="anchor" aria-label="Permalink: Why wordfreq will not be updated" href="#why-wordfreq-will-not-be-updated"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div>
  28. <p dir="auto">The wordfreq data is a snapshot of language that could be found in various
  29. online sources up through 2021. There are several reasons why it will not be
  30. updated anymore.</p>
  31. <div class="markdown-heading" dir="auto"><h2 tabindex="-1" class="heading-element" dir="auto">Generative AI has polluted the data</h2><a id="user-content-generative-ai-has-polluted-the-data" class="anchor" aria-label="Permalink: Generative AI has polluted the data" href="#generative-ai-has-polluted-the-data"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div>
  32. <p dir="auto">I don't think anyone has reliable information about post-2021 language usage by
  33. humans.</p>
  34. <p dir="auto">The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at
  35. large is full of slop generated by large language models, written by no one to
  36. communicate nothing. Including this slop in the data skews the word
  37. frequencies.</p>
  38. <p dir="auto">Sure, there was spam in the wordfreq data sources, but it was manageable and
  39. often identifiable. Large language models generate text that masquerades as
  40. real language with intention behind it, even though there is none, and their
  41. output crops up everywhere.</p>
  42. <p dir="auto">As one example, <a href="https://pshapira.net/2024/03/31/delving-into-delve/" rel="nofollow">Philip Shapira
  43. reports</a> that ChatGPT
  44. (OpenAI's popular brand of generative language model circa 2024) is obsessed
  45. with the word "delve" in a way that people never have been, and caused its
  46. overall frequency to increase by an order of magnitude.</p>
  47. <div class="markdown-heading" dir="auto"><h2 tabindex="-1" class="heading-element" dir="auto">Information that used to be free became expensive</h2><a id="user-content-information-that-used-to-be-free-became-expensive" class="anchor" aria-label="Permalink: Information that used to be free became expensive" href="#information-that-used-to-be-free-became-expensive"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div>
  48. <p dir="auto">Before I wrote this page, I'd been looking at how I would run the tool that
  49. updates wordfreq's data sources.</p>
  50. <p dir="auto">wordfreq is not just concerned with formal printed words. It collected more
  51. conversational language usage from two sources in particular: Twitter and
  52. Reddit.</p>
  53. <p dir="auto">The Twitter data was always built on sand. Even when Twitter allowed free
  54. access to a portion of their "firehose", the terms of use did not allow me to
  55. distribute that data outside of the company where I collected it (Luminoso).
  56. wordfreq has the frequencies that were built with that data as input, but the
  57. collected data didn't belong to me and I don't have it anymore.</p>
  58. <p dir="auto">Now Twitter is gone anyway, its public APIs have shut down, and the site has
  59. been replaced with an oligarch's plaything, a spam-infested right-wing cesspool
  60. called X. Even if X made its raw data feed available (which it doesn't), there
  61. would be no valuable information to be found there.</p>
  62. <p dir="auto">Reddit also stopped providing public data archives, and now they sell their
  63. archives at a price that only OpenAI will pay.</p>
  64. <div class="markdown-heading" dir="auto"><h2 tabindex="-1" class="heading-element" dir="auto">I don't want to be part of this scene anymore</h2><a id="user-content-i-dont-want-to-be-part-of-this-scene-anymore" class="anchor" aria-label="Permalink: I don't want to be part of this scene anymore" href="#i-dont-want-to-be-part-of-this-scene-anymore"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div>
  65. <p dir="auto">wordfreq used to be at the intersection of my interests. I was doing corpus
  66. linguistics in a way that could also benefit natural language processing tools.</p>
  67. <p dir="auto">The field I know as "natural language processing" is hard to find these days.
  68. It's all being devoured by generative AI. Other techniques still exist but
  69. generative AI sucks up all the air in the room and gets all the money. It's
  70. rare to see NLP research that doesn't have a dependency on closed data
  71. controlled by OpenAI and Google, two companies that I already despise.</p>
  72. <p dir="auto">wordfreq was built by collecting a whole lot of text in a lot of languages.
  73. That used to be a pretty reasonable thing to do, and not the kind of thing
  74. someone would be likely to object to. Now, the text-slurping tools are mostly
  75. used for training generative AI, and people are quite rightly on the defensive.
  76. If someone is collecting all the text from your books, articles, Web site, or
  77. public posts, it's very likely because they are creating a plagiarism machine
  78. that will claim your words as its own.</p>
  79. <p dir="auto">So I don't want to work on anything that could be confused with generative AI,
  80. or that could benefit generative AI.</p>
  81. <p dir="auto">OpenAI and Google can collect their own damn data, and I hope they have to pay a
  82. very high price for it. They made this mess themselves.</p>
  83. <p dir="auto">— Robyn Speer</p>