Sweeping statements proven true...

but with nuance, always with the nuance...

Jan 04, 2025

Artwork by Jurgen Lehl, from plastic garbage washed up on the island of Ishigaki, Japan. Pix by me!

In the last edition I made a number of sweeping statements, perhaps a bit rashly, speculating that “an end game” may be in sight for the uncertainty around AI and copyright.1 I could see the following trends: 1) a growing realisation that fair use would be a very tough win in the US courts, 2) recognition of the Brussels effect in copyright law (the example and extraterritorial effect of EU law) and 3) the ongoing licensing of copyrighted works by AI companies. It seemed to me convergence would lead to a norm (in law in some jurisdictions, just prudent practice in others) of allowing copyright holders to reserve their rights, or opt-out from AI training.

UK says: Opt Out

Sure enough, some weeks thereafter the UK announced its grand proposal for a copyright compromise between AI and creative industry interests, an EU-style opt-out. Creative industry backlash was quick and fierce.

38,000 signatories to the Statement on AI Training - “The unlicensed use of creative works for training generative AI is a major, unjust threat to the livelihoods of the people behind those works, and must not be permitted.”
The Telegraph: “Labour is on a quest to erase anything uniquely British” (article behind a firewall but the headline makes the link worth it..)
in the FT: “AI’s assault on our intellectual property must be stopped” (also $$$wall)
in Film Stories magazine, “The UK government mulls letting AI firms steal copyrighted work”

You get the picture. You must follow Ed Newton-Rex on BluSky for this and more.

While an opt-out sounds very good in principle, (as argued by Ben Sobel way back in 2017), critics of the UK's proposal point out the immense practical difficulties — the biggest being that one’s content is often distributed all over the internet on sites which one cannot control. (NUS Press recently discovered that one of our titles had been added to LLM training material by a librarian who had mistakenly posted the entire work to their library's repository, just in time to get ingested... ).

Do you have standing?

But one element my analysis did not take into account was the range of other arguments, aside from fair use, for dismissing copyright holder claims in the US process. One important one became visible in a November judgement from Judge Colleen McMahon in the Southern District in New York, in dismissing the DMCA case (not copyright violation per se) brought by journalism outlets Raw Story and Alternet Media against OpenAI. See Kate Knibbs’ excellent coverage.

And from the judgement:

“Plaintiffs allege that their copyrighted works ( absent Copyright Management Information) were used to train an AI-software program and remain in ChatGPT's repository of text. But Plaintiffs have not alleged any actual adverse effects stemming from this alleged DMCA violation. The argument advanced by Plaintiffs is akin to that of the dissent in TransUnion: "If a [defendant] breaches a [DMCA] duty owed to a specific [copyright owner], then that [copyright owner] ... [has] a sufficient injury to sue in federal court." Id. at 450 (Thomas, J., dissenting). To this, the majority of the Court said: 'no.' "No concrete harm, no standing." Id. at 442. Accordingly, Plaintiffs lack Article III standing to seek retrospective relief in the form of damages for the injury they allege.

Using this sort of logic, judges seem to want to see a specific harm to the plaintiff in outputs, not just a generalised wrong. (This is one downside of making copyright policy in courts.) This is why the NY Times and Universal Music cases (among others) were at pains to show copied material in LLM outputs, where earlier cases just focused on the fact that inputs were copied without credit, compensation or consent. Remember that Judge McMahon’s decision is based on the looking for harm in removing copyright information (an offence under US DMCA), not the harm of making copies. That’s a different story. I don’t think this changes my analysis.

“Let us be clear about what is really at stake here. The alleged injury for which the plaintiffs truly seek redress is not the exclusion of CMI from defendant’s training sets, but rather the defendant’s use of plaintiff’s articles to develop ChatGPT without compensation to plaintiff,” Judge McMahon writes. “Whether there is another statute or legal theory that does elevate this type of harm remains to be seen. But that question is not before the court today.”⁠

An Annoying Lawyer

I heard another variation of evasion of claims, from an Australian technology lawyer in a presentation organised by the IMF and a finance industry group. He was very confident (my quote is from my notes):

"Models do not copy works when they are trained, they do not copy them in the internal workings of their models. Unfortunately they sometimes produce copies in their output and we will have to adjust copyright law to manage this."

He had many other silly sweeping statements, like “China is so far behind the US, it isn't really a factor in AI development now”. So I shouldn't have let this bother me, but it did...

The lawyer’s characterisation of what happens in models is easily refuted.

Of course copying takes place in training. First, there is all the copying needed to get works into a usable form for ingestion. Sean Presser (remember him?) had about a week’s work to convert a cache of pirated epubs into usable text files for inclusion in The Pile. Secondly, once you've combined all your files, and stripped out the html, gotten rid of the repetition, etc, then texts are split up into tokens (words, parts of words, punctuation - the process of tokenization). Each token is then given a number key (encoding). The different approaches here can lead to different outcomes from pre-training, this is very much part of the process. But you can easily turn the numbers back into tokens and combine the tokens back into the original texts. To say this is not copying the text is to say that a high resolution digital version of an image is not the image.

Then the encoding is fed into the model, which tries to predict missing words in the sequence of tokens. So that sequence of tokens, portions of the works, are preserved inside the model in pre-training, the subject of “attention” at this point. Finally, once you've made all the adjustments in the “weights” at the different levels of the model based on your auto-complete, you may bring new texts into training. The weights (billions of them) that emerge at the end of the day after pre-training are no longer strictly or only representative of their training data in compressed form, they are also a statistical model of the relationships between words in a huge corpus. But that statistical model is of such accuracy and potency that it can reproduce the training data (especially if it was repeated in the training data, etc).

Scaling at inference

The second very sweeping statement I made in the last newsletter was that the AI industry would be moving away from waiting for returns to scale of training data, and looking in other directions, like quality of data. This became the conventional wisdom just days after my newsletter. “We’ve achieved peak data and there’ll be no more” said former OpenAI Chief Scientist Ilya Sutskever at NeuroIPS, “pre-training as we know it will unquestionably end.” The NYT headline was “Is the Tech Industry Already on the Cusp of an A.I. Slowdown?”.

While there are 9.5 petabytes of internet data sitting in the storage Amazon gives free to CommonCrawl, the amount of original high quality text, deduplicated, cleaned of html code, etc, seems to be capping out around the 15 trillion tokens used to train some of the most recent models. (And then there's the synthetic data - why does the highly potent Chinese DeepSeek model think it is ChatGPT?).

Rather than see this as a slowdown, god forbid, the industry is now talking about “scaling inference”. Rather than investing more in larger training runs, we’re growing the amount of computing power thrown at getting results to individual queries. Actually this has been the case since GPT-3.5 as far as I can tell, with the advancements in performance coming from further reinforcement learning on top of pre-trained models, fine-tuning, and from connecting models and other components together in networks. (For a deep dive into this see Sebastian Raschka's post (from already a year ago, Model Merging, Mixtures of Experts, and Towards Smaller LLMs).

Some of the models in these evolving networks act as agents, increasingly some components are will be more symbolic in nature (listen to this excellent Sam Charrington podcast with a senior Amazon AI researcher who is more from a symbolic - good-old- AI background). Even Gary Marcus predicts more neurosymbolic AI, ie the mixing of LLMs with symbolic models. Others modules will do plain old data retrieval (a la the RAG architecture).

I just finished my free DeepLearning.AI course on “Reasoning with o1” which shows off OpenAI's last-but-one model that has been RL-trained to take a step-by-step approach to answering questions, and which accordingly uses a heck of a lot more tokens talking to itself as it seeks to answer your questions. Adding one or two more orders of magnitude of tokens to answer the same question better is an OK tradeoff if tokens keep getting cheaper and you are a modest user like me, but multiply that over all the 51% of companies that tell Bain that AI implementation is a top three priority for their company and you need a lot more computing power to provide those answers! Which then is why the top four hyperscalers are going to be investing US$ 300 billion in chips and data centers in 2025, as per Morgan Stanley.

More investment, more expense in energy, chips and water, less legal certainty...

Regular readers will notice the declining pace of updates. The fact that I read 51 books for pleasure last year is not a reason. After all, some of the best of them were on AI.

However this important project for NUS Press may have had something to do with it. Lots of extra hours on the day job, and very happy to have done so for the cause of better pandemic preparedness.

But mostly, it’s just that the pace is increasing so much and there are so many great reporters on the beat able to devote more time and attention to this. (Kate Knibbs is linked to twice here… and see also her infographic and tables on the legal cases). Johan Cedmar Brandstedt is fighting the good fight on LinkedIn. Ed Newton-Rex is not only writing and tweeting about this he’s creating business solutions.

And if you want to, or must, follow the US legal cases, check out the invaluable AI Cases Bot on BluSky. Just a few alternatives!

Share AI and Copyright

Edited slightly om 5 Jan to make more clear that Judge MacMahon’s ruling was based on the DMCA claim, that the OpenAI had removed copyright information from texts it had copied. It did not address the copyright violation per se.

I was also a bit freaked out by the fact that Elon Musk had donated US$ 150m to the Trump campaign but I see that estimates are now up to US$ 250 million dollars… more evidence of American confusion around the first amendment and corporate speech… which includes AI-generated speech…