San Francisco-based cloud services company Cloudflare launched a new set of AI tools Monday that aims to give websites the ability to stop unauthorized scraping by AI crawlers—or to charge them for access to their data.
“What we’ve previewed today is the ability for site owners and internet publications to say, ‘this is the value I expect to receive from my site,’” Sam Rhea, a Cloudflare vice president, told Decrypt. “If you’re an AI LLM and you want to scan this content or train against it, or make it part of your search result, this is the value I expect to receive for that.”
The free Cloudflare Bot Management platform allows websites to not only block AI bots but to charge a fee to as many bots as they approve, thereby getting revenue for the platforms feasting for free on their content.
The AI audit tool also gives users the ability to see how its content is being accessed.
As Rhea explained, unlike malicious bots that try to crash websites or cut in line ahead of human customers attempting to access a website, AI crawlers don’t aim to harm or steal but scan public content to train large language models.
Sometimes those bots attribute the information back to the source, plausibly sending valuable traffic, Rhea said. “But other times, they take material, put it in a blender, and share it as if it were just part of a generic source, without any citation. That seems dangerous to me.”
Rhea said as far as Cloudflare, which provides security and performance optimization for websites, could tell, no single platform dominates website scraping activity, adding that it varies by the type of content being scraped at any given time.
Generative AI models require large amounts of data to function and attempt to provide fast and accurate answers as well as create images, videos, and music. AI scrapers are a growing industry and include companies like LAION, Defined.AI, Aleph Alpha, and Replicate that provide AI developers with pre-collected text, voice, and image datasets. According to market research firm Research Nester, the web scraping software industry is estimated to reach $2.45 billion by 2036.
Last year, Ed Newton-Rex, the former head of audio at Stability AI, resigned over how AI platforms claimed that ingesting website data was “fair use.”
“‘Fair use’ wasn’t designed with generative AI in mind — training generative AI models in this way is, to me, wrong,” he said. “Companies worth billions of dollars are, without permission, training generative AI models on creators’ works, which are then being used to create new content that in many cases can compete with the original works.”
Newton-Rex added: “I don’t see how this can be acceptable in a society that has set up the economics of the creative arts such that creators rely on copyright.”
Rhea said smaller AI developers seemed willing to pay to receive selected website content.
“From the conversations we’ve had with foundational model providers and new entrants in the space, is that the kind of ocean of high-quality data is becoming difficult to find,” he said, noting that scientific and mathematical content was especially in demand.
Edited by Josh Quittner and Sebastian Sinclair