calls for transparency

and a worrying lack of response from policy-makers

Nov 04, 2023

Welcome to our new readers. (Well it has been a month since the last newsletter, so there are quite a few new readers…) Very pleased to see so many writers on the list, as well as artists, journalists on this beat, publishers, agents and a few lawyers and copyright office officials as well. Thanks for joining us.

three pictures of the slopes of Mt Snaefell — On the slopes of Mt Snæfell

It was quite a month - my travel to Iceland for the IFFRO conference was followed by a few days with family and friends on the awesome Snaefellsness Peninsula (pictures of my travels are a kind of a tradition with the newsletter, hope that’s OK), and then to the Frankfurt Book Fair, with several other intense copyright discussions. So wonderful to see and meet so many of you as part of these discussions. IFFRO was intense and a highlight in Frankfurt was the excellent Rendez Vous session of the Federation of European Publishers/Fédération des éditeurs européens along with the FEP’s Quentin Deschandelliers, Danish Rights Alliance’s Thomas Heldrup and Hachette General Counsel Arnaud Robert, where we talked about the practical measures publishers can take to try and gain more control over their content.

One important point was on everyone’s mind in Europe.

The battle for transparency

With so many legal cases underway in the US,1 publishers and creators are turning their attention to AI regulation in Europe, and specifically in ensuring that the requirements for transparency in AI training data included in the European Parliament’s version of the AI Act survive the Trilogue discussions.

Transparency of training data is important in at least three ways.

Progress in machine learning depends on understanding the relationship between inputs and predictions (or outputs in generative AI). LLM technology is still young, and there is much we don’t understand about how it works in detail, and the sort of features, representations, abstractions and derivatives of text (and images) encoded in the model weights at different layers of the models. It will be extremely difficult to probe and test models to better understand this if we are working only with outputs, and have no visibility on inputs. Testing of capabilities of the models requires understanding leakage from training data, and without transparency on training data it becomes that much harder to understand whether a model has really generalised, or is just reproducing a pattern it has seen in its training data (or something messily in-between). Researchers generally acknowledge that there is a “crisis in evaluation”2, or at least that evaluation is “semi-broken”3 and that is at least partly caused by open research moving to closed product development too quickly.
Transparency in training data is absolutely necessary to understand possible social harm in the models. How is bias expressed from its sources in training data? Can harmful outputs be controlled by limiting training data, or are some harmful outputs inevitable given the way LLM language pattern matching works? What are the cybersecurity and privacy risks embodied in the models based on their training data? It would seem that transparency in the models is essential to building a solid basis for understanding the risks inherent in the models and that basis is necessary if regulation is to be more than just a kind of forced patching of model outputs. A recent paper has revealed a new angle to the privacy problem, that LLMs can synthesise and bring together at inference time personal information that may be distributed in their training data. See Staab et al, Beyond Memorization- Violating Privacy Via Inference with Large Language Models.
And most in focus for publishers, the legal uncertainty around the copyright status of the models can never be truly resolved without transparency of training materials. The more liberal exceptions to allow text-and-data mining of copyrighted content (EU, Singapore) state that miners must have legal access to the resources in question. But without transparency such exceptions, and the EU opt-out for commercial use of material, will be non-functional, made redundant by the opacity of big tech practices. Do we really need to wait for a lawsuit-driven discovery process to understand what goes into the Books2 dataset that OpenAI trained its GPT-3 models on?

Speaking up on the matter

In the UK it was the Publishers Association, the Society of Authors, the Association of Authors Agents and the Authors’ Licensing and Collecting Society who issued a statement ahead of the AI Safety Summit, saying “An end to the opaque development of AI is long overdue…an issue on which the entire publishing industry is united.”

Ahead of the Frankfurt Book Fair, the Federation of European Publishers joined with the European Writers Council and the European and International Booksellers Federation (EIBF) to “call upon the European co-legislators to seize the opportunity of the AI Act to take decisive action to ensure the transparency of generative Artificial Intelligence (AI).”

And in the US, the News Media Alliance, the “voice of the news and magazine industry” published a White Paper on generative AI that objects strongly to AI training on copyrighted material. Among the key recommendations of the report: “GAI systems should be transparent to publishers. Publishers have a right to know who copied their content and what they are using it for. The Alliance calls for strong regulations and policies imposing transparency requirements to the extent necessary for publishers to enforce their rights.”

Are Open Source LLMs more transparent?

As we’ve seen with Meta’s release of LLaMa-2, “open source” models are not necessarily very open in many ways, and not in fact revealing anything more than their corporate cousins about the training data used for their models.

Proponents of open source AI argue from the Linux example, an amazing achievement of self-organized cooperation of individual developers that is a foundation of the modern internet. But it’s hard to imagine that open source LLMs will give us the same benefits as Linux, much less “democratisation”, when LLMs require 100s of millions of dollars of expensive computation and access to GPUs.

With that constraint, open source AI will always be fully sponsored by large corporate interests, whether Meta (LLaMa-2), the government of Abu Dhabi (Falcon) or a consortium of French sovereign wealth funds, US VCs, Eric Schmidt and the former French digital minister in the Macron government, Cedric O (Mistral). Said Emmanuel Macron, “On croit dans l’open-source,” but in AI it’s a business strategy for highly capitalized groups, not the kind of grassroots cooperation and openness that brought us a robust internet operating system.

and so what is happening with the EU AI Act?

From what news drips out of the EU Trilogue process, we understand that France is not in favor of retaining the transparency portions of the draft act as proposed by the European Parliament. See above re local AI champion Mistral. France is making a play to be the EU center of LLM development, and so seems to be listening to the tech company claim that being transparent on training data would be unduly burdensome to European tech companies, and put them at a competitive disadvantage.

Probably most of this resistance comes from general principles, but some may be due to the details of the EU Act draft phrasing, apparently a very last minute compromise, which says AI companies must publish “summaries of copyrighted data used for training”. I guess legislators thought this might seem less onerous, but I’m not sure that’s right. Under this phrasing, the tech companies will have to do two things: determine which material they have used is in copyright, and find a way to summarize that listing. That is a bit of a burden.

This has all the hallmarks of a last minute compromise to be worked out later. It seems to me that the most efficient way to approach this problem would be to drop the conditions and just require the companies to publish the full inventory of materials used to train LLMs. Copyright holders can then audit whether their materials have been used, whether those materials were legally sourced, and —for European law— whether they were in fact covered by an opt-out at the time of training. The other benefits we’ve discussed to transparency then apply.

This runs up against the view also of tech companies that the mix of training data used for any model is a trade secret, central to the competitiveness of particular models. This was a view first expressed by Ilya Sutskever and it has proven popular with LLM execs.

Irony much? Our use of your copyrighted works without your permission is our trade secret.

And did the call for transparency feature in the UK-hosted AI Safety Summit and President Biden’s Executive Order?

Er, well, not really. Neither the various announcements post-Summit nor Biden’s Executive Order grappled with transparenc of training data or AI’s copyright problem. But at least the Executive Order kicked the copyright can down the road very cleanly:

…the Under Secretary of Commerce for Intellectual Property and Director of the United States Patent and Trademark Office (USPTO Director) shall:
…within 270 days of the date of this order or 180 days after the United States Copyright Office of the Library of Congress publishes its forthcoming AI study that will address copyright issues raised by AI, whichever comes later, consult with the Director of the United States Copyright Office and issue recommendations to the President on potential executive actions relating to copyright and AI. The recommendations shall address any copyright and related issues discussed in the United States Copyright Office’s study, including the scope of protection for works produced using AI and the treatment of copyrighted works in AI training.

This point on copyright was just one of five elements of follow-up delegated to the Under Secretary. The focus on protecting the intellectual property of the model developers found in the US voluntary code of practice gets more space. See above re irony much…

The UK Summit with its focus on AI safety and “frontier models” had even less to say about copyright or transparency into AI model training. (The UK does have an ongoing process for discussing all this, a series of roundtables dedicated to developing a code of practice. But with the folks round the table also facing each other in litigation it’s hard to see how that’s going to work.)

In the next newsletter - Are Gen-AI offerings really just held together with chewing gum and baling wire? The fragility of an architecture of prompts. Also, does worrying about “frontier” models mean you are saying there’s nothing you should do about current risk…?

I’m sure you heard about the latest one, Universal Music et al v Anthropic filed in US District Court in Nashville, TN. What a genius move to prompt Claude to compose a song about the death of Buddy Holly, and how poetic that it did indeed reply with the lyrics to American Pie (“Here’s a song I wrote…”)

See the TWIML Podcast interview with Sara Hooker of Cohere for AI, at https://twimlai.com/podcast/twimlai/multilingual-llms-and-the-values-divide-in-ai/

See Sebastian Racha’s excellent Substack, issue of 23 October 2023.

Ahead of AI

AI and Open Source in 2023

We are slowly but steadily approaching the end of 2023. I thought this was a good time to write a brief recap of the major developments in the AI research, industry, and open-source space that happened in 2023. Of course, this article is only a glimpse of the most relevant topics that are on the top of my mind. I recommend checking out the monthly…

2 years ago · 65 likes · 3 comments · Sebastian Raschka, PhD

AI and Copyright