This inapt analogy...

...is not a basis for blowing off the most important factor in the fair use analysis

Jun 30, 2025

A busy time for the AI and Copyright story! The two recent summary judgements in high profile lawsuits against Anthropic and Meta have generated plenty of headlines that read like victory for big tech. We will talk about this below, don't worry! Disney is now in the game, together with other Hollywood heavies, in a lawsuit against Midjourney, with many images of infringing output in their complaint... And we still don’t know who is in charge at the US Copyright Office…1

That’s bittergourd, deep-fried. Put a little five-spice powder on it. It tastes bitter, but sometimes bitter things hide much sweetness and goodness. From our favourite Okinawan restuarant in Taipei…

I started writing this from Taiwan, land of the AI chip boom and wonderful bookshops and publishers (and amazing food too!) But our first stop is in Europe, where the process of drawing up the AI Act code of practice is seemingly stalled, and in danger being pushed back further. Rightsholders have objected strongly to the 3rd draft of the code.2 One troubling element of the draft was the idea that models already on the market would not have to reveal anything about their use of copyrighted content. It’s ironic, because discovery in the most advanced US legal cases is giving us quite a bit of transparency around the copyright material used for training existing LLMs, and the different practices the model builders followed here. Important records are now public at least for Meta and Anthropic as revealed in the two recent judgements.

headline: Meta and Anthropic used as many books as they could get their hands on, from pirate libraries, knowing they were pirated. Not much that is proprietary here, since everyone seems to have done it!

OK, I don’t want to be a broken record on that subject. Interesting are the details. Metadata on millions of pirated books was collected by Anthropic, including author names and ISBNs (and therefore publisher names). Anthropic, as per the account in Justice Alsup’s summary judgement, “‘ha[d] many places’ from which it could have purchased books, but it preferred to steal them to avoid ‘legal/practice/business slog,’” (OK, the emphasis is mine...and the quotes within the quote being the words of CEO Dario Amodie.)

Anthropic cofounder Ben Mann personally downloaded Books3 in February 2021. In June he downloaded five million books from LibGen, and in July 2022 the company downloaded around five million books from Pirate Library Mirror, or PiLiMi, “which Anthropic knew had been pirated.” Ben Mann was one of those who jumped ship from OpenAI, and is one of the lead authors on the OpenAI GPT3 paper which mentions a mysterious dataset called Books2.

And this wasn't just a download of a mass of undifferentiated text. Anthropic invested time and money to create a kind of universal library that they could analyse and draw on indefinitely, even after pre-training various models. It make senses — the country of geniuses will need a good library. In the words of Justice Alsup, Anthropic wished to collect “all the works in the world and keep them forever with no further accounting”. In a further twist, they hired Tom Turvey, who some of us will know from his time running the Google Books programme, to help them build that library.

According to the account in Justice Chhabria’s ruling, Facebook first downloaded LibGen in 2022, just to, you know, have a look, and in Spring 2023, after escalating the matter to Mark Zuckerberg, “decided to just use the works acquired from LibGen as training data”. After analysing the relevant metadata, they had realised that LibGen contained most of the works they might want to license, and so just gave up on the idea of licensing. In early 2024 Facebook downloaded Anna’s Archive as well, which compiles all the main shadow libraries.

But, and you knew this was coming, all the evidence for what Justice Alsup called the model builders’ “preferring to steal” does not mean that rightsholders won their cases. Said Chhabria: “The plaintiffs are wrong that the fact that Meta downloaded the books from shadow libraries and did not start with an ‘authorized copy’ of each book gives them an automatic win.”

Battle lost. Prospects for the war?

So no automatic win. In fact, no win at all, not yet, although Anthropic’s piracy is still a matter before Justice Alsup. But at least we are now beginning the judicial process of thinking this all through, and much of this thinking—as reflected in these decisions—is favourable to rights holders. Still a long way to go I reckon.

Not so favourable though, this first bit: Both justices seem AI training as a transformative use. Justice Alsup says it is not just transformative, it is “spectacularly so”.

Yes the concept of a transformative use has come a long way from Justice Leval’s 1990 article about how parody and criticism are transformative of the literary works they modestly quote. Transformation is now understood widely—and by these two judges—to mean a new, technologically mediated use. I think it could be productive to pick apart the history of how this concept changed, but we’ll save that for later!

So it is transformative, but is it fair?

But victory on the first factor is not determinative, as the recent Warhol case judgement from the Supreme Court reminds us. There is real action in the fourth factor, the part where justices consider the effects of copying on the market.

They took contrasting approaches, Justice Chhabria seeing real market harm, and Justice Alsup dismissing it with a heavy—a super-heavy—lean into the human metaphor.3 Alsup served up the tired metaphor in a few different variations, one of which:

“if someone were to read all the modern-day classics because of their exceptional expression, memorize them, and then emulate a blend of their best writing, would that violate the Copyright Act? Of course not.”

No indeed! But this straight-out “computers learn like humans” really misses the point, and was too much even for Justice Chhabria, who went out of his way to write against Alsup’s decision (published just a few days before) :

“But when it comes to market effects, using books to teach children to write is not remotely like using books to create a product that a single individual could employ to generate countless competing works with a miniscule fraction of the time and creativity it would otherwise take. This inapt analogy is not a basis for blowing off the most important factor in the fair use analysis.”

Some of the press coverage said Justice Chhabria “respectfully disagreed” with Justice Alsup (and the defendant's amicii too). Maybe just “disagreed” would have been a more apt description.

Round and round in the circle game

Now Chhabria completely rejected one of the complainants’ two arguments for market harm, and I think he really got this part wrong.4 He kept saying the complainants had no standing to argue that the piracy of their books deprived them of potential revenue for licensing their books for LLM training. It’s asserted a few times in the judgement, and my curiousity as to why kept me reading...

It turns out Chhabria dismisses claims about lost licensing revenue by invoking the circularity problem, stating that “harm from the loss of fees paid to license a work for a transformative purpose is not cognizable.” This reasoning suggests that authors aren’t entitled to create a market for AI training licensing at all, ever, because it is transformative.

It’s worth quoting his reasoning at length:

In every fair use case, the “plaintiff suffers a loss of a potential market if that potential [market] is defined as the theoretical market for licensing” the use at issue in the case. Therefore, to prevent the fourth factor analysis from becoming circular and favoring the rightsholder in every case, harm from the loss of fees paid to license a work for a transformative purpose is not cognizable.”

I get that there's a danger of circularity. But to deny even the possibility of a loss of revenue here is also circular—as in short circuiting a holistic consideration of the four factors. Here Chhabria seems to ignore the advice he is giving to fellow justices (I’m looking at you Justice Alsup)—citing the Supremes—that they shouldn't be “robotically applying concepts from previous cases without stepping back to consider context”, especially in cases where technology changes. This is especially true when his own judgement (and Alsup’s) gives evidence that the companies initially saw the need for licensing, and entered into negotiations with rightsholders, before deciding that, hey, piracy was faster and probably they’d get away with it! This is even more especially true when the market for licensing is now a real commercial factor. Wiley just reported US$ 40 million in AI licensing revenue in the year ended April 30, 2025... its stock jumped nearly ten percent on that little AI bump!5

Dilution is harm too…

Still, it seems a bit churlish to go after Justice Chhabria when he made a spirited case for the possibility of market harm to rightsholders.

“…by training generative AI models with copyrighted works, companies are creating something that often will dramatically undermine the market for those works, and thus dramatically undermine the incentive for human beings to create things the old-fashioned way.”

The same reason any exception allowing AI training would be a violation of the Berne Convention three-step test.

The main effect that Chhabria sees is market dilution, not one-for-one substitution.

“...less similar outputs, such as books on the same topics or in the same genres, can still compete for sales with the books in the training data. And by taking sales from those books, or by flooding stores and online marketplaces so that some of those books don’t get noticed and purchased, those outputs would reduce the incentive for authors to create—the harm that copyright aims to prevent.”

In fact, he develops a whole theory of market harm, why some books are more likely to be crowded out than others. He defends his theory against arguments from the Google Books case (looks solid to me!) and sets out how an advocate making a market dilution case might argue it... it’s almost like an instruction book for future cases, I'm not kidding...

And he regrets that the lawyers for Kadrey et al did not raise this argument...

Market harms aplenty

I think Justice Chhabria is right about market dilution.

We all see some of the ways that LLMs lead to market harms. But pointing out market harms from the introduction of LLMs is not enough to win cases: The people bringing the suit have to be the same ones suffering the harm, and they have to establish this harm as a fact, even a potential fact, or at least enough of a fact to move to jury trial.

For some weeks now I’ve been mulling over a newsletter making the point that LLMs mean rightsholders are losing control of subsidiary or derivative rights in their creations. These revenues can be very important to publishers and authors. Will the effectiveness of machine translation mean that the market for translation rights sales starts to shrink? Certainly happening already at an article level in scholarly publishing... And it’s not just translation—the effect applies for audio versions, summaries, various editions for special audiences, the exam questions, the teacher's guide, the podcast of the book (looking at you Google NotebookLM). This is a harm distinct from direct market substitution, crowding out or denial of licensing revenue. It undermines authors’ fundamental right to determine how their creative expressions are used in derivative works — a right explicitly protected under copyright law.

The models are good at some things and bad at others. They are very good at creating derivatives of existing content.

The fact that users can create all these subsidiary versions of copyright work they have (hopefully legal) access to is not necessarily a bad thing for humanity in and of itself (er, probably not?, I’m sure there won’t be any unintended consequences?) I'm just saying it’s a reason that rightsholders should be compensated for the use of their works to train these models.

And then there's the real market harm that newspapers and news outlets are feeling (Justice Chhabria noticed this too). Whereas Google used to send one click to a publisher for every two pages it scraped on their website, now, according to CloudFlare CEO Matthew Prince, who is watching the crawlers closely,

Six months ago, the ratio was one visitor for every six pages, and now it's one for every 18. OpenAI sent one visitor to a publisher for every 250 pages it crawled six months ago, while Anthropic sent one visitor for every 6,000 pages. These days, OpenAI sends one visitor to a publisher for every 1,500 pages, whereas Anthropic sends one visitor for every 60,000 pages.6

Seems like a systematic transfer of value to me! The story continues…

Kate Knibbs, “No One Is in Charge at the US Copyright Office”, Wired.com, June 27, 2025 https://www.wired.com/story/us-copyright-office-chaos-doge/?_sp=f9bd2f0e-fc69-4038-a870-ecf8a8825786.1751096438909

As for example this end March filing from IFFRO: “Therefore, unless specific improvements are introduced, we submit that the AI Office must abstain from any claims that the CoP’s final outcome, in any manner, reflects a process whereby the concerns and recommendations of the authors, publishers and other rightsholders involved were also considered.” Their end-March filing.

Been covering the battle of the metaphors since January 2023 - and loving three-eyed Mickey as long…

Battle of the metaphors

Peter Schoppert

January 16, 2023

Over the weekend, the news breaks of a new class action lawsuit in the foundation model training space, this one against StabilityAI, DeviantArt and Midjourney. Read more about it here. The complaint is here. Let’s talk about this more next week. (I edited the below slightly for clarity post-publication, on the morning of 17 Jan.)

Read full story

I think I will be reading more on the circularity problem in fair use cases however… it looks like a bit of a structural problem with the whole doctrine…

Justice Chhabria's judgement also trusted Meta's expert witness who claimed that their mitigation efforts prevented more than 50 words of any text used in training from being reproduced in outputs. That seems a bit rash considering the findings of the papers we discussed in the last newsletter...

Mariella Moon, “Cloudflare CEO says people aren't checking AI chatbots' source links”, Engadget, https://www.engadget.com/ai/cloudflare-ceo-says-people-arent-checking-ai-chatbots-source-links-120016921.html