Licensing deals are proceeding apace now. Dow Jones, Le Monde, The Financial Times, El Pais. Is this a sad and dramatic “caving in” by desperate news publishers, “extraordinarily distressing”, as The Information founder Jessica Lessin argued last week in The Atlantic? It’s a great piece that sets out the history and reasons for thinking that licensing their content to AI companies will be disastrous for news publishers.
Then five days later OpenAI announced that it had signed a licensing deal with The Atlantic… which surely added to Lessin’s distress.
But the picture may not be quite so extraordinarily distressing. Maybe not quite so… In licensing recent and future content, have the companies given LLM builders absolution for the original sin of training the foundation models? Or have they set the precedent that proves that licensing should have been done in the first place.
What’s for sure is that it’s in everyone’s interest to be more precise about these deals, as we try and understand the role LLMs are starting to play in our media ecosystem. We have to be a lot more educated about the different activities that fit under the rubric of “training the models”. And to the extent possible we should push for more information on the deals to be made public. Hiding the nature of the deals behind the veil of NDAs is classic “divide and rule” stuff. And you can guess who is Caeser here…
Part of the issue here is that even the last two years of development of AI have broadened the many different sorts of uses of publisher content that can be licensed. I only hope the full range of possibilities is well understood by media companies as they enter into their new contracts.
What is “training”?
The term hides a multitude of sins (and some virtues too).
They call it an original sin for a reason. We start with model pre-training: the use of huge datasets that the foundation models copy, then abstract to some degree and out-and-out reproduce to some degree, in their parameters. State of the art has moved from 10b tokens in 2017 (to train Google’s BERT) to 300b tokens in 2020, (to train Open-AI’s GPT-3) to 1.4 trillion tokens to train Google’s Chinchilla (which included four million works in the “Books” dataset and 1.1 billion news stories in the “News” dataset, precise origins never explained (but, duh…)). In 2023 and ‘24 we’re at the point where models are being trained on 15 trillion tokens (let’s say 10 trillion words). Meta’s Llama-3 is one of these.
This is a lot of data, yes, but in the case of the webscrapes is possible to track it all back to specific URLs. I made this argument last week at the World Expression Forum in Lillehammer, Oslo, and was bolstered in this view by meeting and hearing people like Christo Grozev of Bellingcat and Nataliia Romanyshyn of Texty.org.ua who are active in data investigations, understanding what is going on in massive datasets, in the interests of investigative journalism and of debunking Russian disinformation. But that is a posting for another day…
To recap, American tech companies tend to claim that this training was fair use, a hell of a gamble for reasons discussed here. European companies have recourse to a TDM exception for material that has not been opted out, but more and more publishers are opting out, and that rule only went into force in 2021. Any commercial use of copyrighted content webscraped before 2021 is illegal. That is the year that the TDM exception under European digital single market rules came into effect. One thing about introducing an exception is that it makes crystal clear that there was a right that requires an exception to avoid.
And then there’s the complicated question of jurisdiction. An American company scraping European content from a Singapore IP address? Where do you start?
Have any of the companies who have signed license deals given a retrospective license for webscraping (or even use of pirated material) that occurred before the license negotiation? Have any of the deals said “go ahead and use this latest material to pre-train your next generation model, and by the way, we forgive you for copying the older material too”? I hope not, but we simply don’t know.
from RAG to Riches
We do know that the deals involve licensing content for retrieval-augmented-generation (and related) architectures. Publishers, you should understand this architecture well. It’s easy to implement, and it is easy to license! And not only tech companies, but your customers and readers are already starting to use it with your content. Whether and where it will turn out to be a viable business model for publishers is another question but it is getting easier and easier to try it out.
In RAG, when a query comes in, the initial model prompt (“tell me about XX…”) is used to call out specific pieces of content from a datastore, those most relevant to the query. This is the “R for retrieval” in RAG. In at least the most basic implementations of the model, the relevant chunks of material are then pasted into a prompt and sent back to the model, something like “using these ten chunks of content, answer the user’s initial query to tell them about X…”. The model then Generates an answer Augmented by the content Retrieved. The good news from the publisher’s perspective is that the system can record every piece of content retrieved to answer the question. Also important from a legal perspective, specific pieces of content are copied in plain text into the prompt. The content is used to improve and extend the natural language prompt, not change the behaviour of the base model.
Model developers like RAG for a few different reasons:
because they allow a way to update model responses (though not the underlying models themselves),
because RAG architecture cuts down on hallucination, and
because it allows models to link to sources of the information used in responses (good for some business models, not all!)
In the most RAG-appropriate use cases, the prompts now include all the information (or logic) needed to answer the question. The model is only used to rephrase and recast that text into in a form that fits the way the user has asked the question. If your datastore has chunks of data scraped from the web, with the relevant URL attached, then you can also include a hyperlink with your answer. This is the sort of RAG-of-the-Web that companies like Perplexity.ai are shaping into really useful products, albeit ones that are unlicensed and are upending the economic logic of web search.
As Google has learned recently, LLMs are not always so tractable! Things are rarely simple with complex probabilistic models, and RAG answers will also tend to implicate the foundation model, sometimes inducing hallucination, especially if the retrievals are not super-relevant to the initial query. Don’t put glue on my pizza!
RAG is great to license (or develop products on)
You know you are looking at a RAG deal, especially when they involve links back to publisher websites, or “brand exposure” as with OpenAI and The Atlantic or the Financial Times. Many of the reports on deals talk about a variable component, linked to how many times content is found to be relevant and fed into the LLM prompt.
Whether the agreements allow content in the datastore to be used for training the next round of foundation models, we don’t know. Did I say that already? Even when reports say that models are given permission to train on publisher content, the tenor of the reports shows the journalists are not distinguishing between pre-training and retreival for RAG.
“The original training was a violation of copyright and requires compensation, and any discussion about future use comes after a recognition of that fact.” -Vivek Shah, CEO of Ziff Davis
I should add that you don’t have to be a model builder, or even a tech company, to license content for a RAG application. You can also build one with your own content, and just access the LLM via API. Corporate intranets were one of the first areas to get the RAG treatment, and at home in Singapore we are seeing university lecturers firing up RAG-powered chatbots for their students that use content based on course notes and Open Access readings. They are keen to license the textbooks and copyrighted readings too.
Then at the other end of the spectrum there are all of the “talk to your pdfs” applications, which allow users to put content into RAG datastores, uploading, for example, pdfs they’ve downloaded from their library. Abobe has evidently already launched a “chat with this pdf” function into Acrobat.
In many different application spaces such uploading, whether by users or businesses, will start to impinge on copyright owners’ interests. For example, the barriers OpenAI has put in place to stop users uploading copyrighted content to create personalised GPTs offered to the public seem inadequate to say the least… (based on the Alden Group vs OpenAI & Microsoft lawsuit for example).
And then, there is everything in between…
There are many other purposes to which data, structured and unstructured, can be used in the emerging world of LLM applications, RAG or otherwise. There’s all the reinforcement learning and fine-tuning that models need to go through to be useful for different purposes and domains. Both structured datasets and unstructured data with certain qualities will be important. My favourite example of this sort of thing is the way old pros Gordon Crovitz and Stephen Brill licensed Newsguard’s “top 40 fake news narratives” to Microsoft so that they could align their models against repeating them.
As the models get more and more intricate in their interconnections, as we figure out what the heck to really use them for, other licensing opportunities will start to appear. Personally I’m spending time now trying to understand the benchmarks that have been developed for assessing model performance, especially as our goals for models become more diffuse, abstract and multidimensional. At some point this will be publishers’ business too, or should be if we get moving. Licensing content to help develop better benchmarks. Exams for robots. Reference works. Certification.
Still, that’s the future. We can’t forget how we got here. I find I’m rather of the same view as Jessica Lessin and Vivek Shah, CEO of the digital publisher Ziff Davis: “The original training was a violation of copyright and requires compensation, and any discussion about future use comes after a recognition of that fact.” (As quoted in the Wall Street Journal.)