More than two years ago I started this newsletter, fascinated as I was by the development of this new generative AI technology. It looked so incredibly powerful and useful, but — simultaneously— seemed built on such a dubious, or at least debatable, reading of a tricky doctrine of copyright law.
The promising technology grew to an industry, capitalised in the tens then hundreds of billions, and then — in the eyes of the hypemakers — became the foundation of a new universal utility, the infrastructure for a new industrial revolution, etc. But the weakness of the legal foundation didn't change at all. Each day the legal uncertainty stands out more starkly. The stakes just keep mounting.
Roughly two years later, where are we? Regular readers will know it has been some months since my last newsletter, so it seems a good chance to sit back and take stock — how will this play out? (I write this while taking a short Deepavali weekend break in Laos — I'm loving the coffee culture here! not to mention the creators, scholars, publishers and archivists I've met.)
There seem to be two possible paths at the moment.
Path One: Going legit
The first path is that AI training will gradually come into compliance with copyright, that the future of AI training will be built on the basis of licensing and consent, credit and (some) compensation. In this scenario, AI training will pass through its "Napster moment", as was predicted by my colleague Simon Chesterman1
Path Two: “Updating” copyright law
The second possibility is that the hyperscalers push for, and gain, a meaningful change in copyright law, sprinkling magical pixie dust over their “original sin”. I think it's much less likely, but AI leaders like Satya Nadella and Andrew Ng are calling for this more and more openly. My guess is that this is coming as they see the recourse to a fair use claim receding from view? (And as they realise how punitive could be a negative judgement in financial terms — see the “Quadrillions, not billions” issue of the newsletter.)
You’ll notice I do not include as a likely outcome that a US fair use ruling in favour of the hyperscalers drives acceptance of the status quo. But of course it remains a possible outcome.
Why do I think the first path is the more likely?
Licensing is happening. Now yes, much of this licensing is for RAG access or other kinds of post-training, and not large-scale pre-training, but licensing for pre-training is also increasingly common, and more commonly discussed. If there’s licensing, the fair use defense is not as strong, as it means there is a sharply identifiable business that would be harmed by a fair use claim (in addition to the more generalised harm to broader creator markets caused by the operation of models themselves, which still should be legally determinative I believe).
The Brussels effect is real. For the US doctrine of fair use is only part of the story. European law here will be very influential, as it was with privacy regulations. Under current interpretations of the European TDM exception, any large language model training for commercial purposes was illegitimate before 2021, and is allowed after 2021 only if copyright holders neglect to register their rights, under Section 4 of the Digital Single Market Directive. Anyone seeking to deploy their models in the EU will need to show that they have followed European law. There are other interpretations of the DSM exception saying that it doesn't apply to LLM pre-training. I happen to agree with these interpretations, as set out for example in the "German paper"2, but these are debatable. Like the corresponding Singapore law, the EU’s DSM exception was not written with the power of generative AI in mind. However in one sense that argument doesn't matter — if the exception applies, hyperscalers need to license (when rights holders have reserved their rights); if the exception doesn't apply, hyperscalers need to license.
There may be a shifting focus on returns to quality of content as opposed to returns from just scaling content, for better AI pre- and post-training. This of course is uncertain. The emerging supply chain is clear until a new tweak changes things completely. And the AI hyperscalers and application builders are having lots of fun right now with getting bespoke content by generating it themselves (but only able to do so because their generating models have been themselves trained on high quality content). I do think that publishers and rightsholders will end up playing in a market for improved content, as the gap between domains shrinks over time, as LLM post-training becomes more nuanced, as small models get more potent, and as advancements come from the interaction of networks of agents that don't themselves need more powerful models. But for this to come, rights holders have to have incentives to invest, and safeguards against misuse, which means more legal certainty that their work is not just grist for someone else's mill.
The impact of recent high profile court decisions on fair use seems to bode ill for the AI trainers. Here I'm referring to the Andy Warhol Foundation for Visual Arts, Inc. v. Goldsmith but also the Second Circuit appellate decision on the Internet Archive case.
Easier to change the law than comply?
As for the second path, well, Microsoft CEO Satya Nadella's recent comments received lots of publicity, including his rather weak recourse to the “AI's learn like humans” analogy. As covered in the Times of London (Katie Prescott, "Microsoft boss wants AI rethink”, 22 October, 2024):
The boss of Microsoft has called for a rethink of copyright laws so that tech giants are able to train artificial intelligence models without the risk of infringing intellectual property rights.
Satya Nadella, chief executive of the technology multinational, praised Japan's more flexible copyright laws and said that governments needed to develop a new legal framework to define “fair use” of material, which allows people in certain situations to use intellectual property without permission.
...Speaking after Microsoft's launch of virtual employees at an event in London, he compared the situation to that of using information from textbooks to formulate new ideas. “If I read a set of textbooks and I create new knowledge, is that fair use?” he said.
I guess the easiest way to make this change is via exception — I mean we have to assume these leaders have a specific idea for “updating” copyright law in mind when they go public with that idea. Certainly Google has called for a new exception in the UK in their recent white paper (Google, “Unlocking the UK’s AI Potential”, white paper, September 2024, )
But it remains unclear what that exception would look like. I can report from Singapore that our current exception, the world's most generous in favour of "computational data analysis", is not giving model trainers the comfort they thought it might. Neither does Japan's exception seem the slam dunk that Nadella suggested it is in his remarks, a suggestion probably meant more as lobbying for UK lawmakers than the view from Google’s in-house counsel on the Japanese case.
One interesting interpretation of the language of the Japanese exception, which I heard at the STM conference before Frankfurt, is that it will not apply when licensing is offered, as an interpretation of the exception’s inclusion of the wording of the three-step test from the Berne Convention. Possible, though despite the publication of the Japanese Copyright Office’s paper on the exception, I'm still hearing very different interpretations of its meaning and application.
But this suggests a potential convergence between our two tracks. New exceptions may seem at first to give cover to model builders, but if they are circled round with enough conditionalities and guardrails (to bring them in line with Berne for example) then they end up being functionally similar to the EU's “opt-out” for rightsholders3, in the sense that licensing is still preferred in many cases.
Or, in 48 hours Donald Trump may have won the US election and Elon Musk will be determining US policy here. That could well blunt the Brussels effect. OMG I’ve just read that Elon’s donation to Trump Super PACs is up to US$ 114m. Maybe it’s just more fun if you are a mega-billionaire to change the law than comply with existing ones… Maybe these two paths aren’t going to converge after all.
OK, good night, my ballot has been received in the New Hampshire township where I will vote. Sorry for the abrupt end to this newsletter, but now I am going to spend the next 48 hours trying not to obsess about the US election.
PS: Given the degree to which the “input question” has now been so widely socialised, I imagine that the newsletter will not be as frequent as in the past, when I was sounding the alarm bells about Books3, etc. But I’m not going to be looking away!
See see Simon Chesterman, "Good models borrow, great models steal: intellectual property rights and generative AI", Policy and Society, 12 February 2024, https://doi.org/10.1093/polsoc/puae006
Dornis, Tim W. and Stober, Sebastian, “Urheberrecht und Training generativer KI-Modelle - technologische und juristische Grundlagen” (September 4, 2024). Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214 "Copyright law & training of generative AI – technological and legal foundations"
Just so you know, copyright mavens remind us all that copyright is an inherent right, so there should be no “opting” required for copyright protections. The relevant exercise under European law would be a rights reservation, a restatement of one's existing rights.