For those with promises to keep...

and yes, many hours without sleep trying to understand the EU's AI Act, which has just hit a new milestone

Apr 30, 2023

Welcome new subscribers! Especially to those who joined after the panel on AI at the IFFRO Asia Pacific meeting, hosted by Michael Healy, or the webinar I co-hosted last week with Jessica Sänger, for members of the International Publishers Association.

This newsletter focuses on the uncertainty around the copyright status of foundation models, and also pokes around at other AI-related questions. My current view on the copyright side of things can be found at “Good AIs copy, great AIs steal,” a paper I gave at the Asian Pacific Copyright Association annual meeting in October.

In the last newsletter (some time ago now…), I wrote about liability and responsibility and large language models. Who will be responsible for the crazy things that the large language models say? It seemed to me very clear that the big LLM developers were keen to avoid responsibility for the output of their systems, and I was drawing this mostly from position statements from OpenAI.

That article was kind of fun to write, and hopefully a bit of fun to read. This one deals with regulation and European law-making, and so, well, joy is removed, brevity forgotten and footnotes are added. I set out to write a short descriptive piece to cover the news promised by this interesting headline, “EU proposes new copyright rules for generative AI”, but got dragged into finally trying to understand the EU lawmaking process. I hope regular service can be resumed shortly. So did the EU propose new copyright rules? Well, only a little… read on (I’m really selling it…)

And please, my many European readers, don’t hesitate to correct me where I have this wrong.

When writing the last newsletter I hadn’t taken on board how much that very question of liability was a key element of the hot behind-the-scenes debate around the EU’s forthcoming AI Act. A debate which, according to news reports this Friday, 28 April, seems now largely settled. But before we look at the “bargain”, some background.

the EU’s AI Act

the background

First of all, the players, and a bit about the process: The European Commission is charged with coming up with draft legislation and detailed position papers. Then the European Council, the political body representing each of the 27-member states, and the European Parliament, directly elected Members of European Parliament (MEPs), each work on the text. The Council announced its Common Position on December 6, 2022 and the Parliament has now apparently just reached its own deal, although that is still subject to votes, the first in committee on May 11th, the second a plenary vote in mid-June. After that it is a time of “trilogue”, to finally align all three institutions around a text. Things can sometimes happen in trilogue.

The EU’s Act seeks to regulate AI based on its potential to cause harm (to individuals, not the social fabric). This it does by focusing on the particular uses of AI models, and adding different levels of safeguard for AI used in higher risk areas. It betrays its age, as an effort that started formally in 2018, in being better targetted at the “old-fashioned AI” of supervised learning and very narrowly focused machine-learning models.1

How could, or rather should, this approach be extended to the foundation models, with their general utility as engines of plausibility, and huge array of possible deployments via API and various kinds of fine-tuning? In April 2021 when it published its proposal for the Act, the European Commission seems to have had decided it could keep its eyes on the end uses, and didn’t need rules for models that could be used for so many purposes.

The 2021 proposal looked to ban those specific uses of AI that had unacceptably high risk, for example the use of social scoring systems, and facial recognition in anything but national security and defence settings. (A rock is thrown at the Black Mirror!) Uses of AI in education, employment, justice, immigration and law would be considered high risk, and would require conformity to strict rules before being implemented.

But the Large Language Models did not go away. In the words of Euractiv, a news source that has been on this story in detail, if not always clarifying detail: “The meteoric rise of ChatGPT has brutally disrupted the debate, leading to delays.”

In May 2022, the European Council, the higher level body, then under the French Presidency, decided that LLMs were “general-purpose AI” and had to be covered in the Act, given “the nature and complexity of the value chain for AI systems…[and] due to their peculiar nature and in order to ensure a fair sharing of responsibilities along the AI value chain, such systems should be subject to proportionate and tailored requirements and obligations under this Regulation...”

This kicked off a great deal of debate in European Parliament (and Council). There were evidentlly many MEPs who argued that since “general purpose AI systems” could be used in high-risk ways, they should be regulated according to the high risk standard. This possibility then excited intense lobbying efforts by Big Tech companies to ensure that their models were kept as far from the risk framework as possible. Said Google:

“We would suggest to clarify that when other actors in the value chain modify a general-purpose system in a way that makes it high-risk, they should assume the responsibilities of a provider, and that the developer of the general-purpose system is not a provider under the AIA”2

That quote, from a document obtained under a Freedom of Information request, and much of what follows, comes from the excellent report by Camille Schyns, The Lobbying Ghost in the Machine, Brussels: Corporate Europe Observatory, Feb, 2023.

Similar statements followed, from Microsoft, the Business Software Alliance, and others. Particularly odd was support of this position from organizations ostensibly speaking for startups, considering this position shifted liability 100% from Big Tech to the many startups using LLMs via API. This oddity has led to suspicions of astroturfing by big tech, and a formal complaint on this matter was lodged by two MEPs.

Also notable was the direct involvement of the US government, via an unsigned “Non-Paper regarding the September 2022 Revisions to the Draft EU Artificial Intelligence Act”:

“Requiring all general purpose AI providers to comply with the risk-management obligations of the AI Act would be very burdensome, technically difficult and in some cases impossible. General purpose AI suppliers may have limited visibility on the subsequent use of their general purpose AI system, the context in which the system is deployed and other information necessary to ensure compliance with the iterative risk management obligations required for High-Risk AI systems”

so, to the deal

Well, first I have to say that reports of the Parliamentary level deal are not based on a published text (at least as far as I can find). We have various spokespeople telling us a deal has been done, but for the edtails we are relying on summaries by reporters who have seen agreed drafts. Some elements are a little murky.

Still, the overall shape of the GPAI compromise is clear enough, as already set out by the Council in December. GPAI will only have to comply with some of the responsibilities of high risk users, and will also have its own separate set of responsibilities. One of these responsibilities headlined that Reuters story that caught my eye, “EU proposes new copyright rules for generative AI”. Not sure I would have gone that far, but the story points to a requirement that the creators of foundation models would have to make publicly available a summary disclosure of the use of training data protected under copyright law. The Reuters story gives additional detail: “Some committee members initially proposed banning copyrighted material being used to train generative AI models altogether, the source said, but this was abandoned in favour of a transparency requirement.”

Here are some of the other points as reported by Euroactiv:

General purpose AIs will not be regulated, except for foundation models…Hard to say whether the Act is intending to divide their angels onto the heads of two different pins here, or whether the reporters have scrambled it a bit. My money is on the latter.3
A “downstream economic operator” would become responsible for complying with the AI Act’s stricter regime if they substantially modify an AI system, including a general purpose AI system (or a foundation model?), to use it in ways that would qualify it as a high-risk model.
There follow a set of rules and a call for model contracts that are intended to ensure that the foundation model provider is either providing downstream players with the data they need to fulfill obligations in high-risk situations, or can “restrict the service so that the operator can comply with the AI rules without further support.”

What must the foundation model providers do on their own? It’s a long list, but in addition to providing (summary) transparency on copyrighted training data, these points seem key:

The providers will need to document a testing regime “and show how to mitigate reasonably foreseeable risks to health, safety, fundamental rights, the environment, democracy and the rule of law. That testing will require the involvement of independent experts.“
They will need appropriate levels of interpretability, corrigibility and cybersecurity, as established by independent experts and testing
Generative AI foundation models must comply with further transparency obligations and implement adequate safeguards against generating content in breach of EU law.

One last minute addition to the compromise draft, according to Euractiv, applies specifically to generative models, and says they “would have to be designed and developed in accordance with EU law and fundamental rights, including freedom of expression.” Twitter could not explain what this means…

How will compliance be tested and understood? The Commission are the ones who will have to get into the details, and according to much of the commentary, they are expecting the European Committee for Standardization (CEN) and the European Committee for Electrotechnical Standardization (CENELEC) to draw up the technical standards that “are expected to define when and how AI systems will respect (among other things) fundamental rights.”4

What happens next?

The amended text still has a committee vote to get through on May 11th. Then it goes to plenary in mid-June, following that we have the “trilogue”.

Right now it is rather hard to pin down the full implications of the proposals, in their various summary statements and bureaucratic language. But now I (and hopefully you) have a kind of overview of the situation.

Then on 4 May we will have a hearing in San Francisco on Microsoft’s motion to dismiss the Co-pilot case (J. DOE 1 et al. v. GitHub, Inc. et al.). I’ve just been dipping into the filings, nor do I have any context on this sort of thing, but it looks pretty bad-tempered, as the plaintiffs are complaining the defendants have been sharing information on the identity of the plaintiffs that was meant to be under seal, and that this has led to various sorts of threats against the plaintiffs.

Other news, and some good articles

Probably you saw the Washington Post piece - “See the websites that make AI bots like ChatGPT sound so smart” that digs into another big public dataset, C4, and gives you a search tool. Sure, notorious book piracy sites are found here as well! A much nicer version of what I did for the Books3 database (part of The Pile).
Chat-GPT is back online in Italy
Tech platform companies are demanding that AI training companies license their content rather than just scrape. Making public statements are Twitter, Reddit and Stack Overflow. Yep…
Google is asked to submit its views on enforcement of digital piracy, and uses the chance to lecture Australia on why they need a Singapore-style exception for text and data mining, without which “Australia risks only ever being an importer of certain kinds of technologies.”
Amazon launched a number of generative model tools, including a free code-completion tool, in partnership with HuggingFace.
Interesting piece from the Copyright Alliance Blog on How Existing Fair Use Cases Might Apply to AI
More on how unsuited LLMs are for internet search.
- “Google employees label AI chatbot Bard ‘worse than useless’ and ‘a pathological liar’ - report”, from James Vincent at The Verge
- and an academic paper which measures the performance of the conversational search engines, those like AI-Bing that give citations for their answers: “on average, a mere 51.5% of generated sentences are fully supported by citations and only 74.5% of citations support their associated sentence.” Although interestingly enough, the paper was sponsored by Amazon Web Services…

And to conclude this looong email, a poem on language and who gets to have it and why, which I found in my Substack intray, from L. M. Sacasas on The Convivial Society.

“Their Lonely Betters,” W. H. Auden (1950):

As I listened from a beach-chair in the shade

To all the noises that my garden made,

It seemed to me only proper that words

Should be withheld from vegetables and birds.

A robin with no Christian name ran through

The Robin-Anthem which was all it knew,

And rustling flowers for some third party waited

To say which pairs, if any, should get mated.

Not one of them was capable of lying,

There was not one which knew that it was dying

Or could have with a rhythm or a rhyme

Assumed responsibility for time.

Let them leave language to their lonely betters

Who count some days and long for certain letters;

We, too, make noises when we laugh or weep:

Words are for those with promises to keep.

Lawyer Danny Tobey goes over this ground in an engaging episode of Craig Smith’s Eye-on-AI interview podcast, April 28. He put it this way “they did a pretty good job over the last couple years putting this regulation together and then ChatGPT comes out and flips the apple cart over.” I was hoping to hear Tobey weigh in on the copyright question, but he has a problem giving an opinion on that side of things: “We have content creators [as clients] who are worried about generative AI using their content to create new content. And how does that get traced back to the original? But then we have very well-known generative AI clients who are thinking about it from the other perspective of fair use.” These matters are being discussed…

Google’s “Feedback on general purpose AI systems” to the European Commission, 11 July 2022, obtained under a Freedom of Information request by Corporate Europe Observatory.

See for example this scrambled explanation, from Euractiv’s Tech Brief of 23 March, 2023: “This week, the technical discussions on the AI Act focused on General Purpose AI. The direction is taken is to distinguish ‘true’ GPAI models from foundation models based on the training datasets. GPAI is considered with unlabelled data that need further training by the provider, such as algorithms developed to recognise skin cancer. By contrast, foundation models, as defined by Stanford University, include ChatGPT and Stable Diffusion and have labelled data. Whilst discussions on the obligations are still ongoing, a possible compromise taking shape is to have a stricter regime for foundational models and more basic obligations for GPAI.” This seems to be the compromise reached, but the distinction needs work!

See Clément Perarnaud’s blogpost at the Center for European Policy Studies (CEPS), With the AI Act, we need to mind the standards gap.

AI and Copyright