|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384 |
- title: Why wordfreq will not be updated
- url: https://github.com/rspeer/wordfreq/blob/master/SUNSET.md#i-dont-want-to-be-part-of-this-scene-anymore
- hash_url: 1cd6127ccec88387f4804f0c3cf1011a
- archive_date: 2024-09-30
- og_image: https://opengraph.githubassets.com/23640beea69cc6e481c29b906530677a890056e4755f495489cf39654858b963/rspeer/wordfreq
- description: Access a database of word frequencies, in various natural languages. - rspeer/wordfreq
- favicon: https://github.githubassets.com/favicons/favicon.png
- language: en_US
-
- <p dir="auto">This documentation page has gotten a lot of attention recently! I
- think most of the people who find it understand where I'm coming from. I'd
- like to highlight a couple of things, now that people are linking to this
- page from all sorts of contexts.</p>
- <ul dir="auto">
- <li>
- <p dir="auto">I still work on open-source libraries. Here's <a href="https://github.com/rspeer/python-ftfy">ftfy</a>,
- the popular multi-purpose Unicode fixer.</p>
- </li>
- <li>
- <p dir="auto">You could see this freezing of wordfreq data as a good thing. Many people
- have found wordfreq useful, and the latest version isn't going away. The
- conclusion that I'm documenting here is that <em>updating it would make it
- worse</em>, so instead, I'm not updating it. It'll become outdated over time,
- but it won't get actively worse. That's a pretty okay fate for something
- on the Internet!</p>
- </li>
- </ul>
- <div class="markdown-heading" dir="auto"><h1 tabindex="-1" class="heading-element" dir="auto">Why wordfreq will not be updated</h1><a id="user-content-why-wordfreq-will-not-be-updated" class="anchor" aria-label="Permalink: Why wordfreq will not be updated" href="#why-wordfreq-will-not-be-updated"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div>
- <p dir="auto">The wordfreq data is a snapshot of language that could be found in various
- online sources up through 2021. There are several reasons why it will not be
- updated anymore.</p>
- <div class="markdown-heading" dir="auto"><h2 tabindex="-1" class="heading-element" dir="auto">Generative AI has polluted the data</h2><a id="user-content-generative-ai-has-polluted-the-data" class="anchor" aria-label="Permalink: Generative AI has polluted the data" href="#generative-ai-has-polluted-the-data"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div>
- <p dir="auto">I don't think anyone has reliable information about post-2021 language usage by
- humans.</p>
- <p dir="auto">The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at
- large is full of slop generated by large language models, written by no one to
- communicate nothing. Including this slop in the data skews the word
- frequencies.</p>
- <p dir="auto">Sure, there was spam in the wordfreq data sources, but it was manageable and
- often identifiable. Large language models generate text that masquerades as
- real language with intention behind it, even though there is none, and their
- output crops up everywhere.</p>
- <p dir="auto">As one example, <a href="https://pshapira.net/2024/03/31/delving-into-delve/" rel="nofollow">Philip Shapira
- reports</a> that ChatGPT
- (OpenAI's popular brand of generative language model circa 2024) is obsessed
- with the word "delve" in a way that people never have been, and caused its
- overall frequency to increase by an order of magnitude.</p>
- <div class="markdown-heading" dir="auto"><h2 tabindex="-1" class="heading-element" dir="auto">Information that used to be free became expensive</h2><a id="user-content-information-that-used-to-be-free-became-expensive" class="anchor" aria-label="Permalink: Information that used to be free became expensive" href="#information-that-used-to-be-free-became-expensive"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div>
- <p dir="auto">Before I wrote this page, I'd been looking at how I would run the tool that
- updates wordfreq's data sources.</p>
- <p dir="auto">wordfreq is not just concerned with formal printed words. It collected more
- conversational language usage from two sources in particular: Twitter and
- Reddit.</p>
- <p dir="auto">The Twitter data was always built on sand. Even when Twitter allowed free
- access to a portion of their "firehose", the terms of use did not allow me to
- distribute that data outside of the company where I collected it (Luminoso).
- wordfreq has the frequencies that were built with that data as input, but the
- collected data didn't belong to me and I don't have it anymore.</p>
- <p dir="auto">Now Twitter is gone anyway, its public APIs have shut down, and the site has
- been replaced with an oligarch's plaything, a spam-infested right-wing cesspool
- called X. Even if X made its raw data feed available (which it doesn't), there
- would be no valuable information to be found there.</p>
- <p dir="auto">Reddit also stopped providing public data archives, and now they sell their
- archives at a price that only OpenAI will pay.</p>
- <div class="markdown-heading" dir="auto"><h2 tabindex="-1" class="heading-element" dir="auto">I don't want to be part of this scene anymore</h2><a id="user-content-i-dont-want-to-be-part-of-this-scene-anymore" class="anchor" aria-label="Permalink: I don't want to be part of this scene anymore" href="#i-dont-want-to-be-part-of-this-scene-anymore"><svg class="octicon octicon-link" viewbox="0 0 16 16" version="1.1" aria-hidden="true"><path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path></svg></a></div>
- <p dir="auto">wordfreq used to be at the intersection of my interests. I was doing corpus
- linguistics in a way that could also benefit natural language processing tools.</p>
- <p dir="auto">The field I know as "natural language processing" is hard to find these days.
- It's all being devoured by generative AI. Other techniques still exist but
- generative AI sucks up all the air in the room and gets all the money. It's
- rare to see NLP research that doesn't have a dependency on closed data
- controlled by OpenAI and Google, two companies that I already despise.</p>
- <p dir="auto">wordfreq was built by collecting a whole lot of text in a lot of languages.
- That used to be a pretty reasonable thing to do, and not the kind of thing
- someone would be likely to object to. Now, the text-slurping tools are mostly
- used for training generative AI, and people are quite rightly on the defensive.
- If someone is collecting all the text from your books, articles, Web site, or
- public posts, it's very likely because they are creating a plagiarism machine
- that will claim your words as its own.</p>
- <p dir="auto">So I don't want to work on anything that could be confused with generative AI,
- or that could benefit generative AI.</p>
- <p dir="auto">OpenAI and Google can collect their own damn data, and I hope they have to pay a
- very high price for it. They made this mess themselves.</p>
- <p dir="auto">— Robyn Speer</p>
|