It used to be that my digging into the data sources for LLM training to understand their use of copyrighted content was a lonely enterprise… well now it’s mainstream, as it should be. Frequency of this newsletter has gone way down, partly because so much great work is being done elsewhere.
The latest is the blockbuster New York Times story, digging in to the shortcuts that big tech has taken in their hunger for data. Google Docs anyone?
The New York Times is of course running a high profile legal case against Microsoft and OpenAI for their LLM training, and in common with many publishers, is using their robots.txt files to attempt to block webscrape companies that are feeding generativeAI, both future foundation model training efforts, and efforts to use material in different ways (adding to RAG datastores for example) to keep models current and updated. (I just learned that the Internet Archive does not respect robots.txt instructions from news publishers…see the link at the end of this newsletter.)
This led to the surprising claim that the New York Times is burning books. The words were from the Common Crawl Foundation’s new CEO Rich Skrenta, in an excruciating interview with Jeff Jarvis and Jason Howell on their Inside AI podcast. It is Skrenta’s characterization of NYT’s request that their content be removed from the Common Crawl database that is being used to train large language models, the sort of request that Common Crawl has long said they would honour.
Egging each other on, Jarvis, Howell and Skrenta piled on the hyperbole: It’s “burning books”, it’s “bad for civil society,” “it’s bad for democracy,” even “it’s possible this could be the end of the Internet…” My favourite reason Jarvis gives for why the New York Times should not pull its data from Common Crawl is that if they do, right wing news sources will have much greater comparative influence in LLM training data.
Who is CommonCrawl and why are they so upset?
OK, stepping back a bit. Who is Common Crawl and why are they so upset by this request? I’ve been wanting to write about Common Crawl for some time, because it is now the central source for text used to train AIs, becoming even more central over the last year or so. This is the organization that does the webcrawling that provided the raw material for the largest bulk of the training sets assembled by AI companies.
Common Crawl is the source for the C4 dataset, widely used by Meta, Google and others to train their models. C4 is the dataset that the Washington Post built a search tool around for the article “See the websites that make AI bots like ChatGPT sound so smart”. If you think that somehow “crawling the internet” means only public content from forums and the like, revisit that WaPo story and see how many publishers are among the most important sources in the dataset (including also websites of pirated books).
A filtered version of the Common Crawl webscrape makes up the base for some 60% of the text used to train GPT-3 (with Books and Books2 at about 16%, the balance coming from things like PubMed, Google Patents, the Enron Emails and so on). Common Crawl webscrapes are common to the training of nearly every model that has revealed its training data, from LLaMa-1 (20%), to France’s Mistral and Singapore’s SEA-Lion (100%).
So Common Crawl has created billions of dollars of value for Big Tech.
Who is Common Crawl?
In his interview, Skrenta portrays Common Crawl as a library or archive first and foremost, preserving our memory of the great human endeavour of the internet. But if you dig into the roots and origins of Common Crawl, and follow the money, it’s hard to stay with that image.
Common Crawl was started as a 501(3)c non-profit by an entrepreneur named Gil Elbaz. Elbaz hit the big time way back in 2003 when he sold his company to Google for US$ 103m. He joined Google to shepherd the integration of his product, AdSense, into Google’s online advertising offering. These were early days in the shift of the ad market from publishers to platforms, and by better matching ads to content, AdSense helped Google establish its dominance in online advertising. In 2011 Google reported that some US$ 9 billion, or 28% of its total revenue, was coming via AdSense.
But in 2007/8 Elbaz left Google, and started a new business, Factual. He established the Common Crawl Foundation around the same time. Data before 2013 is not readily available from the CommonCrawl repository, but it has created petabytes of copied webpages since that time.
In 2013 Gil Elbaz, gave a TV interview about Common Crawl and his “in parallel” data aggregation business Factual (later merged with Foursquare). He discussed both together. You can watch the 2013 interview here — long before GenAI appeared on the scene. Here is Elbaz from that interview:
So Common Crawl is quite simply, it's a copy of the Web. The Web, it's the most fantastic collection of knowledge that humanity has assembled. So why not make it available? We want to make this available to as many people as possible. Make it available to everyone.
In order to build upon what the Web is, in order to extract deep insight from it, you need to access it. And while most people think that the Web is open, because as a human, you can browse and look at whatever you want, in order to do deep research on top of it, you need to access all of it, all at once. Very quickly.
Now, we want this to be available for educational purposes, for research, but also for startups and for business.
Indeed, for business. Elbaz’ Factual, created with Common Crawl, was based gathering, fusing and aggregating information on locations. It was used to power location-based advertising. Factual later merged with Foursquare, which remains a big player in the data-fusion business, helping to understand location data, and tracking locations of users with “machine learning models that combine first-party GPS, Cell, Wifi, Bluetooth, Accelerometer, and time-of-day data.”
Foursquare is private, but has around 400 employees, and according to Wikipedia as of April 10, Elbaz remains on the Board.
The Board of Directors of Common Crawl comprises Elbaz and two former executives from Factual.
It put me in mind of one of my favourite books of last year, The Hank Show, by Mackenzie Funk, that describes the origins of the datafusion business, the merging of data from different sources to create detailed information on individuals. A must read to really understand what companies can know about you (a lot!) and the dynamics of the way data has been aggregated over the last forty years. While the anti-hero of Funk’s book started by aggregating records on individuals from DMVs around America, he was quick to build the business on the information that was coming from online services and the internet. Webscraping to collect data on people and places is not a new thing.
Follow the money
Still, I wondered about the business model that allowed such a large-scale (but non-profit!) webscraping operation to be at the very foundation of the LLM revolution. As a 501(3)c non-profit, Common Crawl Foundation has to make its tax returns public via the IRS. Only 2016-2018 are on file when I looked last month, but they show a very interesting — if outdated — picture. What was the cost of the computer infrastructure needed to support all the sort of webcrawling that OpenAI used to train GPT-3 on in the late 20-teens? Between 2012 and 2018 Common Crawl paid just US$ 65,012 for computer equipment.
On the three years reported, it spent an average of US$ 129,677 per year on software development, US$ 36,971 on webhosting and maintenance, and 129,677 US$ on admin expenses.
A frugal operation! Its April 2018 webcrawl included 230 TiB of uncompressed content from 3.1 billion web pages, making that available for anyone to download. The infrastructure to do that only cost 40,000 bucks?
I imagine Common Crawl has geared up a bit more in the last few years since their last filing available online. They’ve hired a CEO, and spruced up their website, gotten a lot less casual about their online terms of use, which until 2023 said “We didn’t produce the crawled content, we just found it on the web. So we are not vouching for the content or liable if there is something wrong with it.”
But the explanation for how they afford a massive webscraping operation and the hosting of huge amounts of data, this is no mystery: It’s paid for by Amazon.
Amazon has long sponsoredf CommonCrawl’s crawling and the hosting of huge amounts of data via its Open Data Sponsorship Program. Check it out here https://registry.opendata.aws/commoncrawl/
None of this accumulation of data to train LLMs would have happened without Amazon. Probably they got a tax writeoff for donating all of that storage space.
Webscraping - the good and the bad
Cory Doctorow wrote a blogpost last year that looks at webscraping. He makes the very useful point that there are some real benefits to webscraping, in having a public record of the internet, and how and when it has changed, when tech companies have changed their terms of service, or how companies manipulate pricing, etc. Heck, I consulted the Wayback Machine just now to check on Common Crawl’s terms of service. (It’s too bad it’s private capital that has paid for these webscrapes and archives of the internet… this sounds like something better done by the public sector…)
Doctorow has been angry for decades at the ways capital—big companies—use copyright to protect their interests. I don’t think anyone is going to change his mind on that and it means his perspective is often at odds with those who are promoting the benefits and importance of copyright. Like me.
But he’s even angrier at big tech, and it makes for some very good reading in the piece linked above. Here’s a long excerpt as he talks about the usefulness of copyright in constraining the worst of AI practices.
Copyright has some uses in creative labor markets, but it’s no substitute for labor law. Likewise, copyright might be useful at the margins when it comes to protecting your biometric privacy, but it’s no substitute for privacy law.
When the AI companies say, “There’s no way to use copyright to fix AI’s facial recognition or labor abuses without causing a lot of collateral damage,” they’re not lying — but they’re also not being entirely truthful.
If they were being truthful, they’d say, “There’s no way to use copyright to fix AI’s facial recognition problems, that’s something we need a privacy law to fix.”
If they were being truthful, they’d say, “There’s no way to use copyright to fix AI’s labor abuse problems, that’s something we need labor laws to fix.”
This lie of omission is great tactics. It demoralizes many AI critics at the outset, who’ll say, “Well, I like all these benefits the world gets from scraping, so I guess I have to put up with all the bad stuff these giant companies are doing.”
And for critics who are too angry for that, this lie turns them into people who explicitly align themselves against all the benefits scraping delivers. These critics end up helping the AI companies.
When critics get suckered into saying, “You can’t have the benefits of AI without the costs,” they’re doing PR for the AI companies. There are plenty of people who’ll hear that statement and say, “OK, well, I guess privacy and labor rights are the price we’ll have to pay.”
We don’t have to be patsies for the AI companies. We can call them on their bullshit.
Cory, hope you forgive me for citing such a big block of your terrific essay. It’s not clear you’ve published the website on a CC license, but I kind of am assuming you have.
I agree with Doctorow that better privacy laws and better labour laws are going to be essential in an AI-enshittified world. But I do think copyright law may have more of a role to play in managing this situation, as I developed in “Good AIs copy, great AI’s steal”. There are lots of good reasons to webscrape, ie to copy things that might later change, but the line can be drawn when the copying is used to create tools that harm the interests of those whose works have been copied. It’s simple enough.