news

Claude team aroused public anger by using any means to crawl data, changing the name of the crawler and ignoring the prohibition rules

2024-07-31

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Hengyu from Aofei Temple
Quantum Bit | Public Account QbitAI

Claude's team has angered the public this time!

reason:Visit a company's server 1 million times in 24 hours, crawling and scraping website content for free.

Not only did it blatantly ignore the "no crawling" notice, but it also forcibly occupied server resources.

The "victim" company actually tried its best to defend itself, but failed, and the content data was still taken away by Claude.



The head of the company was so angry that he blew his beard and glared, and he started to speak passionately on the microphone:

Hey Anthropic, I know you're hungry for data. Claude is really smart!
But you know what, it's not! Cool! Oh!



Many netizens were indignant about this. One netizen who works as a copywriter left a message saying:

I suggest using the word 'stealing' rather than 'not paying' to describe Anthropic's behavior.。”



For a moment, everyone was outraged!

Some supported the condemnation, some demanded that Claude pay, the comment section was a complete mess.



How is this going

The company that strongly condemned Anthropic is callediFixit, is an American e-commerce and how-to website.

Part of iFixit's business is providing free online, Wikipedia-like repair guides for consumer electronics and gadgets.

Within the websiteThere are millions of pages, including repair guides, revision histories of guides, blogs, news posts and research, forums, community-contributed repair guides, and a Q&A section.

However, iFixit suddenly discovered that Claude's crawler program ClaudeBot had thousands of access requests every minute within a few hours.

This equates to nearly a million visits to its website in one day.

According to statistics, it accessed 10 TB of files in one day and a total of 73 TB in May.



For this reason, iFixit CEO Kyle Wiens said:

ClaudeBot stole all our data without permission and filled up our servers... Fine, that's no big deal.
I wonder if it has crawled into our license instructions?

Yes, you read that right, “without permission”.

iFixit actually wrote a statement:

The reproduction, copying, or distribution of any content, materials, or design elements on this site for any other purpose (including training machine learning or artificial intelligence models) is strictly prohibited without the express prior written permission of iFixit.



It makes no difference.

Not only did Claude continue his visit-scratching spree, he also evaded iFixit's defenses.

iFixit actually successfully blocked two of Anthropic's AI gripping robots, named "ANTHROPIC-AI" and "CLAUDE-WEB."

But these two AI crawling robots seem to be a thing of the past, and the current main crawler is the "ClaudeBot" that has not been successfully blocked.

As a last resort, Lao K said that iFixit modified the robots.txt file this week specifically to block Anthropic's crawler robots.



So, any response from Anthropic?

They did not shut up, and responded to the media:

ANTHROPIC-AI and CLAUDE-WEB are indeed old crawlers used by the company, but they are no longer in use.

Of course, Anthropic sidesteps the question of whether the currently active ClaudeBot respects robots.txt blocking crawlers.

This is not the first time AI companies have done this

Looking through Anthropic's official website, you can find that there has long been an article titled "Does Anthropic scrape data from the web? How can website owners stop scrapers?"

It mentions:

Following industry standards, Anthropic uses a variety of data sources for model development, such as publicly available data from the internet collected through web crawlers.
Our crawlShould not be intrusive or disruptive
Our goal is to increase the crawl rate by taking into account the speed of crawling the same domain and, where appropriate,Respect crawl delay to minimize disruption



But it is not difficult to find out from the public opinion that Anthropic obviously did not do so.

It crawls other people's data without permission.A habitual offender

For example, in April this year, the Linux Mint forum was crawled.

Over the course of several hours, ClaudeBot visited the forum multiple times to crawl data, causing the forum to be in an extremely slow or crashed state for several hours, and eventually completely collapsed.

Some people said that in the same period of time, ClaudeBot occupied the most traffic, 20 times that of the second place and 40 times that of the third place.



In the discussion threads of both the April incident and this incident, some people suggested:

Since it is useless to issue a notice banning crawling, why not put some false information with traceable or unique information on the website to detect who stole the data?

iFixit did exactly that.

And it really works - I found that the information on my website was not only crawled by Claude, but also crawled by OpenAI...



Logically speaking, what can we do? There really is no solution at all.

Because except for Claude and GPT,There are quite a few AIs that forcibly steal homes.

A few days ago, a robot detection startup called Tollbit claimed that Perplexity, Claude, and OpenAI would ignore the robots.txt settings on crawled websites. At that time, someone asked OpenAI about its attitude, and OpenAI declined to comment.



Looking further back, this happened again last month.

Forbes condemned the AI ​​search product Perplexity for allegedly plagiarizing its news articles; this caused a huge uproar, and more media outlets stood up to accuse Perplexity's crawler robot PerplexityBot of illegally crawling information from their own websites.

Perplexity’s attitude has always been:

Respect publishers’ requests not to scrape content, and operate within the boundaries of fair use copyright law.

Theoretically, whether it is ClaudeBot or PerplexityBot, when encountering a file marked "No crawling" or "No robot.txt", they should comply with the protocol and avoid crawling the content of the website of the declaring party.

Since the declaration is invalid,Some have called on creators to move content to paywalled areas as much as possible to prevent unlimited scraping.

Do you think this approach will be effective?

Reference Links:
[1]https://www.404media.co/websites-are-blocking-the-wrong-ai-scrapers-because-ai-companies-keep-making-new-ones/
[2]https://www.404media.co/anthropic-ai-scraper-hits-ifixits-website-a-million-times-in-a-day/
[3]https://twitter.com/kwiens/status/1816128302542905620
[4]https://x.com/Carnage4Life/status/1804316030665396356
[5]https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler?ref=404media.co