Blocked bots and billion-dollar battles: the high-stakes fight for AI training data

02 June 2025 | by CLAIMS - Serbia

The ever-growing AI industry brings new legal and ethical challenges regarding how AI uses website content to learn, as well as how organizations use AI in the production and distribution of their products.

With the rise of generative AI, the demand for content used to train models has skyrocketed. The web is AI systems’ main source of data for learning.

One way to obtain data is by scraping the internet using AI crawlers. These bots download and index content from all over the internet. Such bots aim to understand what every webpage is about so they can retrieve the information when needed.

The Good (Bot), the Bad (Bot), and the Restrictions

It is common to divide AI bots into “good” and “bad” ones:

“Good” bots perform useful or helpful tasks, like scanning content or interacting with webpages. Good bots do not harm a user’s experience on the internet.
“Bad” bots are designed to violate the rights of website owners and users. They assist in copyright infringements, privacy violations, and unfair trading.

One sign that a bot isn’t “good” is its bypassing of a website owner’s restrictions. One such restriction is the robots.txt file. This tool allows website owners to direct web crawlers on their website and avoid unnecessary attention to specific pages or the entire site.

However, even “good” bots that serve the highest purpose of humanity — technological development and human prosperity — are under attack.

Web owners are leveraging every tool at their disposal, such as the robots.txt file and specific clauses in Terms of Service, to restrict AI access to their content.

Research from the article “Consent in Crisis: The Rapid Decline of the AI Data Commons”, between April 2023 and April 2024, approximately 25% of the most critical online data sources became inaccessible to AI crawlers. This percentage feels significant, particularly when one frequently hears about the lack of quality data to further improve models or about synthetic data’s limitations, along with other predictions warning of the AI sector’s eventual decline.

According to the same research, the percentage of those opposing AI bots might have been lower. The problem is that, in many cases, website owners fail to effectively articulate their preferences on how their data should be used by AI systems.

Recent studies reveal a paradox:

The AI web crawlers that are blocked the most are not necessarily the most active ones.

For example, while the most aggressive AI bot is Bytespider (a web crawler operated by ByteDance, the Chinese owner of TikTok), reportedly achieving scraping volumes 25 times higher than OpenAI’s GPTBot, websites tend to block OpenAI’s web crawler more frequently.

Many users may not even be aware of the more aggressive AI bots actively crawling their sites.

AI, News, and Courts

AI companies, the leaders of modern progress, raise concerns among website owners and provoke intense disputes. When data holders do not want their web data to be used by AI companies, even for the legitimate and socially significant purpose of improving AI models, they have the right to demand that their rights be respected. Don’t they?

In December 2023, The New York Times filed a lawsuit against OpenAI and Microsoft for copyright infringement. In November 2024, a group of Canadian news outlets — including CBC/Radio-Canada, Postmedia, Metroland, The Toronto Star, The Globe and Mail, and The Canadian Press— followed suit, launching a lawsuit against OpenAI on similar grounds.

The main complaints in both cases are:

1. AI companies are violating copyright when scraping data from websites.

2. Extracting key information from news articles shifts web traffic away from news sites to AI chatbots. As a result, news agencies see a decrease in advertising and subscription revenue, while AI developers benefit.

3. AI “hallucinations” create an aura of mistrust regarding the news sources AI refers to.

OpenAI is not the only target. Similar lawsuits have been filed against Microsoft, Perplexity, and Anthropic. While most website owners share these concerns, why have legal challenges been rare until now?

A possible answer is that companies feared being the first to sue and losing their case. It is likely that once the first ruling goes against an AI company, lawsuits will become more frequent.

Gladly, we recently had the first winning ruling in the Delaware State and the first winning ruling in Germany. But who’s winning?

In the U.S., the Thomson Reuters vs. ROSS Intelligence case set a decisive precedent by ruling that using Westlaw’s proprietary headnotes to train an AI model constitutes copyright infringement, especially when the AI product competes directly with Westlaw, leaving the fair use defense untenable.

Conversely, in September 2024, the Regional Court of Hamburg in Germany offered a different perspective.

The court determined that an AI research association’s compilation of a vast image-text dataset, including a watermarked photograph, fell within the scope of text and data mining exceptions for scientific research. This decision rested on the fact that the image was publicly accessible, and its copyright restrictions were not embedded in a machine-readable format.

Together, these rulings illustrate that uncertainty still lingers in the legal landscape for AI training on copyrighted material. Each decision was based on a specific set of circumstances, meaning that while these cases provide important insights, they do not necessarily resolve the broader disputes, such as the ongoing lawsuits between news media outlets and OpenAI, which involve different factual contexts.

Maybe, a deal?

While most news publishers fear that AI bots could threaten their industry, not everyone shares these concerns. If one strategy for website owners is to fight against AI companies, another is to opt for cooperation.

Axel Springer (POLITICO, Business Insider, BILD, and WELT) is going against the trend for news publishers by announcing a global partnership with OpenAI. Through this deal, ChatGPT users will receive summaries of selected global news content from Axel Springer’s media, with attribution and links to the full articles for transparency and further information. French Le Monde and Spanish Prisa Media have also partnered with OpenAI.

AI Dolce Vita

The Italian media conglomerate GEDI Gruppo Editoriale sought to collaborate with OpenAI by providing editorial content for AI training. However, given OpenAI’s history of regulatory scrutiny in Italy — including a temporary ban in 2023 and a €15 million fine in 2024 — such initiatives are under close watch by the Italian Data Protection Authority (Garante).

In a recent decision, the Garante warned GEDI that its agreement with OpenAI raises compliance concerns under the GDPR. Whether this rigorous oversight strengthens data protection for Italian residents or stifles technological progress remains open to debate. However, one thing is clear: it creates significant compliance hurdles for AI companies operating in the Italian market.

Another deal?

Not only do some websites allow AI to access their content, but they also put AI to their own use. For some, creating and developing their own AI tools is becoming a strategic option.

News agencies such as The Associated Press and Bloomberg are already using AI to eliminate repetitive tasks and allow journalists to focus on higher-impact reporting. Bergens Tidende in Norway conducts customer satisfaction surveys on its website using AI.

India Today has created an AI news anchor, Sana, who delivers news headlines alongside their main TV anchor. These are just a few examples of websites deploying their own AI tools.

On one side, there is significant creative, human, and financial investment made by creators and website owners. If AI developers do not seek permission or provide compensation for scraped data, human creativity, truthful content, and the rights of creators and website owners will inevitably suffer.

On the other side, web data scraping fuels AI with diverse texts, images, and videos — improving efficiency and increasing data accessibility. Scale and data matter, and very few sources provide public-scale data like the web does.

While AI models could be trained on alternative data sources, big companies are already turning to synthetic data. However, some studies suggest that training on poor-quality synthetic data can degrade AI model performance.

The most recent regulation addressing this issue is the EU AI Act. While the AI Act does not directly regulate the relationship between web data holders and AI companies, it does attempt to find a balance — one that could be extended to other jurisdictions in the future.

It states that if rightsholders decide to reserve their rights to prevent text and data mining (i.e., opt-out), model providers must obtain their authorization to use such protected content. This presumption of consent could be a great solution for both data holders and AI developers. However, it remains unclear how web data owners should articulate their rights, which could lead to potential abuses by AI companies.

Since the robots.txt file is not legally binding, the most effective solution would be to establish a legal mechanism for protection. Possibl approaches include: DMCA strikes based on the circumvention of copyright protection systems.

Explicitly defining access restrictions for AI bots in website Terms of Use and raising fines for violations. Addressing the issue through fair competition law.

What’s next?

As we can see, an effective way to oppose AI bots hasn’t been discovered yet. Except, perhaps, one: negotiation and finding mutually beneficial arrangements.

#VeilMagazineTimeDrop

Step Into the Future with Veil Magazine

This article is just 1 of 6 visionary pieces from Veil Magazine — our special edition exploring AI, Data, and Innovation through the dual lenses of law and futurism.

Each page unlocks a new layer of tomorrow’s world, where technology meets ethics, and the future of regulation is already being written.

Read the full version of Veil Magazine via the link:

https://claimsip.com/promo/veil-magazine

Blocked bots and billion-dollar battles: the high-stakes fight for AI training data

02 June 2025 | by CLAIMS - Serbia

Related articles

IP people moves and firm announcements

How conflicts are affecting IP protection in Africa

Global law firms quit Russia: the latest updates

Cape Verde steps up its pace in IP protection

About our research and publication