The hunger for text, for high quality text, continues

and a little bit of Katong in Geneva

Apr 20, 2024

The hunger for text continues. This week saw the launch of Meta’s Llama-3, in the downloadable “open source” 7b and 20b parameter versions, with a teaser to the still-training internal Llama-3 model which has 400 billion parameters. With Llama-3, the state of the art in data collection is now 15 trillion tokens (depending on tokenizers, the equivalent of some 10.6 trillion words of text). The Meta blogpost mostly explains the high performance of its relatively small models based on more, better data. Needless to say, the transparency around what is in this enormous dataset is lacking. It is “collected from publicly available sources”. The standard term these days.

But we can safely say that’s a lot of text: the Google Books database of four to five million books was estimated in 2016 to be around half a trillion words.1 Still, the jump in size shouldn’t be too surprising by itself, we can estimate that each of CommonCrawl’s quarterly webscrapes (which CEO Rick Skrenta estimates cover 5% of the web each go-round) includes around three to four trillion words (leaving aside metadata in the WET files, etc).2

So there’s plenty of data still to extract text from. The constraint has been for some time how to filter and clean those webscrapes to get high quality text that’s useful for the models. Well here, Meta deployed a new tool to help them filter out high quality texts from the big mass of undifferentiated webscrape: Llama 2— “surprisingly good at identifying high-quality data”— created the classifiers that helped filter the raw pages (already stripped of html etc, but CommonCrawl). Again, we see the acceleration that comes from bundling models, plugging them into each other, and also that the models are really good at analysing text (separately from generating it) — see my WIPO intervention below.

From the blogpost announcing the new models (a research paper is promised eventually):

We also performed extensive experiments to evaluate the best ways of mixing data from different sources in our final pretraining dataset. These experiments enabled us to select a data mix that ensures that Llama 3 performs well across use cases including trivia questions, STEM, coding, historical knowledge, etc.

So there is not only the filtering for “quality”, there is a process of selection of materials, an acknowledgement that a certain kind of quality, and diversity, is necessary, not just pure scale for scale’s sake. Though I have to admit there is plenty of scale here… And of course to mix in such an intentional way, you have some good idea of what you have. So two implications of that:

A strong implication that it is the expressive, particular aspects of text that copyright protects, its very quality, that are indeed what is valuable in the copying for AI training.
and two, more evidence that you can be transparent about exactly what content you have used…

Can the copyright conversation keep up? Well it is moving…

The World Intellectual Property Organisation Standing Committee on Copyright and Related Rights had its first official discussion of AI and copyright, and I feel very honoured to have been invited to contribute to this. See also the coverage in the excellent IPA Blog. With my face on a huge screen in a room in Geneva OMG:

That’s my library! Photo by Anne Bergman

But it was a stressful Thursday evening in Singapore! Let’s just say tech problems. AI and digital technologies don’t really matter when there’s no power… But I got back online about 15 minutes before my intervention was scheduled.

Interventions at WIPO SCCR, photos by Nina George and Anne Berman.

There were some surreal video effects, but I think they added to the experience as television, and I gave my intervention, more cri-de-coeur from a conflicted practitioner than cool lawyerly analysis, but hey, as well as doing my best to analyse coolly with you my gentle readers, I am a deeply conflicted practitioner who is keen to use the technologies, especially in ways explained in my presentation.

I’m told the member countries generally found the sessions very informative and useful. Congrats to Michele Woods, Paolo Lanteri and all the rest of the WIPO team for putting it together. I didn’t catch all of the interventions, see tech problems above, but of those I did, I can say Duncan Crabtree-Ireland’s is worth watching. He’s from SAG-AFTRA, and made the point that his organisation represents creators as creators and workers, so intervenes on copyright questions but also questions of labour conditions. As hinted in last week’s newsletter, the issues are different, but also deeply related (as copyright is meant to enable the system of incentives for more, better, creation). He also makes the point that collective bargaining has proven a faster process than our other tools in figuring out how to deal with AI and creators’ rights so far.

Kai Welp’s intervention was also highly interesting, in very lawyerly fashion. He’s General Counsel, GEMA, Germany, a collective rights management group representing 90,000 music creators. Welp talked about the practical issues involved in implementing the opt-out for text-and-data-mining under EU and German law, including the fact that the platforms that hold his members’ works are not controlled by his members. He also made the point that compensation for use of content in training hardly begins to replace the loss in income from AI generated material, as its output will also displace and reduce the market for his members’ work downstream. Geoff Taylor of Sony Music also had a useful “ecosystem” approach to the issues.

And here’s mine…

Long-time readers will not find anything really new here, but for new readers, this sums up my views. Only the tiniest bit of editing from the WIPO transcript.

Mr. Paolo Lanteri: Good evening. I do not know whether you can see the room but it is full. You are intervening after a librarian. You are a publisher. Large language models are possibly the most developed tool within the AI landscape and certainly for laymen like us, the most popular, we started using them largely, even here in this organization. And the question for you is really how publishers feel about the increasing capabilities of generative AI. Do you use them, do you use generative AI tools and are they making your life easier or do they represent a threat or both ? Thank you very much.

Mr. Peter Schoppert: Yes. I am going to speak about a very deep ambivalence. So no clear-cut unidirectional messages from me. Yes, good afternoon, everyone. I am the director of the National University of Singapore Press, NUS Press. We publish books, journals and open access online resources, mostly Asia related humanities and social sciences. We publish both for academics but also for the book shops, for the general reader. We believe it is important that university research finds its place into the marketplace of ideas.

Can I just say a quick thank you to WIPO. You have done so much for our Southeast Asian region over the last 10, 15 years. For the first time really now we have real publishing ecosystems that are beginning to work. We have newly vibrant publishers in Malaysia, Indonesia, Vietnam, Thailand publishing books in local languages, book shops, book streets in Vietnam that are just the most exciting places. And a lot of that has to do with WIPO’s work in supporting Intellectual Property and Intellectual Property protection. And just thank you to everyone in the staff and, of course, to the Member States.

So as a humanities publisher our great subject is language. Languages but also the products of language, whether that is oral or that is written down in text. So, of course, humanities scholars and their publishers are fascinated by new tools that allow us to manipulate, understand, model, classify, even predict text. But this enthusiasm for the latest and greatest in natural language processing, the LLMs, is tempered by a very deep disquiet. I mean, I am totally conflicted because I do not believe that the training of these models was legal. I do not believe that it was fair. I do not believe that it was done in the proper manner.

We have, you know, just in the region when copyright is now being internalized by creators and by readers we are getting this challenge to the system which feels quite difficult. So, you know, one of the things here is that we have seen very clearly how the big tech companies have used pirated books to train their models.

I spent my adult life in Singapore. I am based in Singapore. I represented the Singapore book industry as head of the publishers association. Singapore has, as you may know, one of the most liberal text and data mining exceptions in the copyright law. But even under this exception in Singapore law, the training of Meta’s Llama or any model that uses Google’s C4 Colossal Cleaned Crawled Corpus would be illegal because it relies on notorious caches of pirated content. But even if we leave aside the egregious use of pirated material by big tech to train models, the use of “free-to-read-for-humans” webscrape content is also often in violation, of copyright, of the terms of service of the websites from which it was scraped and so on. I believe this is also not lawful.

And for those who would claim that this is fair use under American law, let me just mention three inconvenient facts.

One, the models routinely output exact or nearly exact copies of what they are trained on. That is because the content is inside the models. In interesting ways, in highly compressed ways, in ways which will open up so many interesting avenues for research and exploration. But the content is in the models and we see that when the models spit it out and reproduce it. Something that the companies have not been able to stop from happening so far.
Also, two, I believe the models are not extracting, pulling facts from a sea of expression as perhaps was the case when we came up with text and data mining exceptions five, eight years ago. They are actually modeling expression, protected expression. That is what the statistical models are creating. They are creating models of expression of how words relate to each other. These models are bad at facts, right ? We all know that. They are good at style. They are copying expression.
And lastly and most importantly in this venue, the copying to train the models hurts those people whose works were copied in the first place.

Again, even in Singapore with our liberal exception, our top legal scholars argue that the Singapore exception would not allow the training of LLM models for this very reason, the three-step test.

So as a publisher, I am torn deeply between the desire to use these models and a sense that they are illegitimate.

How do we use them?

So what do we do ? Actually, we are exploring in many directions and, of course, we are trying and getting a feel for the models, learning what they can do. We have got a bunch of little software scripts now that we were able to create with the help of models that we could not have done on our own before that help us with routine tasks. But actually, we are focused very little on the generative capabilities of AI. That is much less interesting to us as a humanities press, right ? What we are really interested in are the analytical capabilities of these models.

Search, for example, the models use similarity in the mathematical properties of text, vector similarity to do search. And that is amazing. It is really good. A huge improvement over keyword searching. And that gives that could be a great help to research and to content ecosystems if we preserve…make sure we are preserving authorial integrity and research integrity of the text. If we use the models in the right way, they can help us draw connections between the text and find material much more easily. It is very exciting. But it requires that we protect copyright and protect integrity even more so that we can use these tools.

Secondly, what we are working on very actively with many of our authors is coming up with ways that LLMs can supercharge development of the digital humanities. That is the ability to work with large archives of text and pull out insights from them that would be impossible for any one person to do. We are not interested in generating new text. We are interested in learning and understanding human output and human creation better. And, of course, these efforts will have their best effects if they are done side by side with the classic skills of deep reading and of close reading and of slow reading and of understanding deep human texts.

So I am deeply conflicted, but I do not think I need to be. I do not believe that there is a contradiction that somehow we have to accept we have to allow copying without credit compensation or most importantly consent in order to have the benefits. I think it is possible to build models that will respect the rights of creators. And I am really looking forward to getting to the point where we can do that. You know, the systems we have for moving from information to knowledge to wisdom, the importance of integrity, of authorship, of speech which is not just free but also socially accountable, these things are really important. It is a sort of thing that publishers have been working on—imperfectly—for literally hundreds of years. So I would rather stick with those institutions and make use of this amazing technology within that structure, especially copyright. Thank you.

Mr. Paolo Lanteri: Thank you very much, Peter. I forgot to mention that for you it must be well probably early morning of tomorrow or I do not know. So I do not know whether you are you want to go to sleep and we will put an alarm for the Q&A, but we really appreciate your intervention, your time and you let us know whether you will be with us later.

and for those of you still with us

An excellent tweet from Ed Newton-Rex, formerly of Stability AI. I wish I had put this so well:

The AI Equivalence Fallacy: the argument that humans learn from copyrighted work without permission, so generative AI should be allowed to too. It's a rare day when I don't hear this argument. It ignores (i) the scale & effect of generative AI and (ii) the implicit social contract under which people have created and published works for centuries.
You think that’s mad, wait till you hear the argument that it’s an honour to have your work trained on without permission because AI is the cultural record of humanity.3

Share AI and Copyright

Pechenick EA, Danforth CM, Dodds PS (2015) Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution. PLoS ONE 10(10): e0137041. https://doi.org/10.1371/journal.pone.0137041

This is very rough! The latest CommonCrawl webscrape is 8.4 TiB = 9.2 terabytes of compressed text. GZIP can compress plain UTF-8 text as much as 4:1, but let’s be a bit more conservative and say we have 20 Tb of text in each scrape, let’s say 90% in single-byte scripts. That’s 18 Tb / 6 bytes per word plus space = three trillion words. Let’s add the Chinese, say that’s 2 Tb at three bytes per character (if in UTF-8) that’s = 667 billion characters, and Chinese is quite high density compared to English, so around roughly let’s say another 1 trillion “English-word equivalents”.

Actually we did hear that from a speaker in Geneva, but let’s pretend we didn’t.