larlet-fr-david-cache

A place to cache linked articles (think custom and personal wayback machine)

title: Block the Bots that Feed “AI” Models by Scraping Your Website url: https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/ hash_url: af6aeab9b8

“AI” companies think that we should have to opt-out of data-scraping bots that take our work to train their products. There isn’t even a required no-scraping period between the announcement and when they start. Too late? Tough. Once they have your data, they don’t provide you with a way to have it deleted, even before they’ve processed it for training.

These companies should be prevented from using data that they haven’t been given explicit consent for. Opt-out is problematic as it counts on concerned parties hearing about new or modified bots BEFORE their sites are targeted by them. That is simply not practical.

It should be strictly opt-in. No one should be required to provide their work for free to any person or organization. The online community is under no responsibility to help them create their products. Some will declare that I am “Anti-AI” for saying such things, but that would be a misrepresentation. I am not declaring that these systems should be torn down, simply that their developers aren’t entitled to our work. They can still build those systems with purchased or donated data.

There are ongoing court cases and debates in political circles around the world. Decisions and policies will move more slowly than either side on this issue would like, but in the meantime, SOME of the bots involved in scraping data for training have been identified and can be blocked. (Others may still be secret or operate without respect for the wishes of a website’s owner.) Here’s how:

(If you are not technically inclined, please talk to your webmaster, whatever support options are at your disposal, or a tech-savvy friend.)

robots.txt

This is a file placed in the home directory of your website that is used to tell web crawlers and bots which portions of your website they are allowed to visit. Well-behaved bots honor these directives. (Not all scraping bots are well-behaved and there are no consequences, short of negative public opinion, for ignoring them. At this point, there have been no claims that bots being named in this post have ignored these directives.)

This what our robots.txt looks like:

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

The first line identifies CCBot, the bot used by the Common Crawl. This data has been used by ChatGPT, Bard, and others for training a number of models. The second line states that this user-agent is not allowed to access data from our entire website. Some image scraping bots also use Common Crawl data to find images.

The next two user-agents identify ChatGPT-specific bots.

ChatGPT-User is the bot used when a ChatGPT user instructs it to reference your website. It’s not automatically going to your site on its own, but it is still accessing and using data from your site.

GPTBot is a bot that OpenAI specifically uses to collect bulk training data from your website for ChatGPT.

Google-Extended is the recently announced product token that allows you to block Google from scraping your site for Bard and VertexAI. This will not have an impact on Google Search indexing. The only way this works is if it is in your robots.txt. According to their documentation: “Google-Extended doesn’t have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity.”

Omgilibot and Omgili are from webz.io. I noticed The New York Times was blocking them and discovered that they sell data for training LLMs.

FacebookBot is Meta’s bot that crawls public web pages to improve language models for their speech recognition technology. This is not what Facebook uses to get the image and snippet for when you post a link there.

ChatGPT has been previously reported to use another unnamed bot that had been referencing Reddit posts to find “quality data.” That bot’s user agent has never been officially identified and its current status is unknown.

Updating or Installing robots.txt

You can check if your website has a robots.txt by going to yourwebsite.com/robots.txt. If it doesn’t find that page, then you don’t have one.

If your site is hosted by Squarespace, or another simple website-building site, you could have a problem. At present, many of those companies don’t allow you to update or add your own robots.txt. They may not even have the ability to do it for you. I recommend contacting support so you can get specific information regarding their current abilities and plans to offer such functionality. Remind them that once slurped up, you have no ability to remove your work from their hold, so this is an urgent priority. (It also demonstrates once again why “opt-out” is a bad model.)

If you are using Wix, they provide directions for modifying your robots.txt here.

If you are using WordPress, there are a few plugins that allow you to modify your robots.txt. Many of these include SEO (Search Engine Optimization) plugins have robots.txt editing features. (Use those instead of making your own.) Here’s a few we’ve run into:

Yoast: directions
AIOSEO: directions (there’s a report in the comments that user agent blocking may not be working at the moment)
SEOPress: directions

If your WordPress site doesn’t have a robots.txt or something else that modifies robots.txt, these two plugins can block GPTBot and CCBot for you. (Disclaimer: I don’t use these plugins, but know people who do.)

For more experienced users: If you don’t have a robots.txt, you can create a text file by that name and upload it via FTP to your website’s home directory. If you have one, it can be downloaded, altered and reuploaded. If your hosting company provides you with cPanel or some other control panel, you can use its file manager to view, modify, or create the file as well.

If your site already has a robots.txt, it’s important to know where it came from as something else may be updating it. You don’t want to accidentally break something, so talk to whoever set up your website or your hosting provider’s support team.

Firewalls and CDNs (less common, but better option)

Your website may have a firewall or CDN in front of your actual server. Many of these products have the ability to block bots and specific user agents. Blocking the four user agents (CCBot, GPTBot, ChatGPT-User, Omgilibot, Omgili, and FacebookBot) there is even more effective than using a robots.txt directive. (As I mentioned, directives can be ignored. Blocks at the firewall level prevent them from accessing your site at all.) Some of these products include Sucuri, Cloudflare, QUIC.cloud, and Wordfence. (Happy to add more if people let me know about them. Please include a link to their user agent blocking documentation as well.) Contact their support if you need further assistance.

NOTE: Google-Extended isn’t a bot. You need to have this in your robots.txt file if you want to prevent them from using your site content as training.

.htaccess (another option)

In the comments, DJ Mary pointed out that you can also block user agents with your website’s .htaccess file by adding these lines:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|Omgilibot|Omgili|FacebookBot) [NC]
RewriteRule ^ – [F]

I’d rate this one as something for more experienced people to do. This has a similar effect to that of the firewall and CDN blocks above.

NOTE: Google-Extended isn’t a bot. You need to have this in your robots.txt file if you want to prevent them from using your site content as training.

Additional Protection for Images

There are some image-scraping tools that honor the following directive:

<meta name="robots" content="noai, noimageai">

when placed in the header section of your webpages. Unfortunately, many image-scraping tools allow their users to ignore this directives.

Tools like Glaze and Mist that can make it more difficult for models to perform style mimicry based on altered images. (Assuming they don’t get or already have an unaltered copy from another source.)

There are other techniques that you can apply for further protection (blocking direct access to images, watermarking, etc.) but I’m probably not the best person to talk to for this one. If you know a good source, recommend them in the comments.

Podcasts

The standard lack of transparency from the “AI” industry makes it difficult to know what is being done with regards to audio. It is clear, however, that the Common Crawl has audio listed among the types of data it has acquired. Blocks to the bots mentioned should protect an RSS feed (the part of your site that shares information about episodes), but if your audio files (or RSS feed) are hosted on a third party website (like Libsyn, PodBean, Blubrry, etc.), it may be open from their end if they aren’t blocking. I am presently unaware of any that are blocking those bots, but I have started asking. The very nature of how podcasts are distributed makes it very difficult to close up the holes that would allow access. This is yet another reason why Opt-In needs to be the standard.

ai.txt

I just came across this one recently and I don’t know which “AI” companies are respecting Spawning’s ai.txt settings, but if anyone is, it’s worth having. They provide a tool to generate the file and an assortment of installation directions for different websites.

https://site.spawning.ai/spawning-ai-txt

Closing

None of these options are guarantees. They are based on an honor system and there’s no shortage of dishonorable people who want to acquire your data for the “AI” gold rush or other purposes. Sadly, the most effective means of protecting your work from scraping is to not put it online at all. Even paywall models can be compromised by someone determined to do so.

Writers and artists should also start advocating for “AI”-specific clauses in their contracts to restrict publishers using, selling, donating, or licensing your work for the purposes of training these systems. Online works might be the most vulnerable to being fed to training algorithms, but print, audio, and ebook editions developed by publishers can be used too. It is not safe to assume that anyone will take the necessary efforts to protect your work from these uses, so get it in writing.

[This post will be updated with additional information as it becomes available.]

9/28/2023 – Added the recently announced Google-Extended robots.txt product token. This must be in robots.txt. There are no alternatives.

9/28/2023 – Added Omgilibot/Omgili, bots apparently used by a company that sells data for LLM training.

9/29/2023 – Adam Johnson on Mastodon pointed us at FacebookBot, which is used by Meta to help improve their language models.

index.md 13KB Неформатований Blame Історія