Pardon the gap in newsletters: I took a break in the Margaret River region of Western Australia, and revelled in the nature, from wallabies encountered on the trail, flaming galahs and ringnecks, and a close encounter with one of the last living colonies of stromatolites. But no quokkas!
Those are traces of me running up and down the beach at Gas Bay. Not many others around!
It’s the year end, and journalists are doing their “best of” stories. For those covering science and technology, even with long-awaited breakthroughs in nuclear fusion and the success of bespoke mRNA vaccines in treating cancer, the development of the generative AI models is gathering a lion’s share of attention. The power of the models, the thrill one gets on first using the tools, the fact that much of the development is being done in a very open way, and the ever-present pull of the idea of Artificial General Intelligence, all make for a potent combination, a very heady mixture which seems to be turning the head of many commentators. (It would turn mine too, but I’ve been here before, full of zeal for the Web in the days of the first Mosaic browser.)
For a perfect example, see the newsletter/column by The Atlantic’s Derek Thompson. He hails 10 breakthroughs of the year, and generative AI is number one. It is clear that Thompson is enamoured of the tools, with his heading hailing “a new dawn in how we build”. The tools “seem to trace the outer bounds of human creativity”. He confidently expects “they will serve as assistants for those of us in creative industries” without a jot of consideration for the fact that they were built on the uncompensated labour and unauthorized copying of the output of those very creative classes.
The “copyright uncertainty” is not making it into mainstream coverage of these models. Why is this, when even the machine learning professors at Stanford teach their students that training of LLMs on copyrighted text may not be legal? (See the class notes for Stanford CS 324 Large Language Models.)
The answer seems pretty simple. Because the publishing industry has been silent.
I don’t understand why, and I think we are losing a big opportunity. I say this not because I think we should block the development of this technology, we can’t and shouldn’t, but because a) creators should be compensated for their contribution to this new technology and b) we should for a whole variety of other reasons be pushing for a greater level of consideration of these new technologies before vast sums of money are raised into start-ups to exploit them, and they are deployed at scale by companies like Google, Microsoft and Adobe. (Except that it’s already happening…)
For a fresh consideration of the legal arguments see former RIAA attorney Neil Turkewitz’s consideration based on documents from the ongoing lawsuit against Github, Microsoft, and OpenAI. Yes, he’s going after some easy targets in Github’s bizarre citation of “global copyright law”, but his analysis of the legality of copying under the current UK and EU exceptions is sharp and goes into greater depth than mine.
Aside from that ongoing lawsuit (see the complaint here), the only consistent voice raising the question of the legal status of the copying that trained the models is coming from professional visual artists, most prominent of whom in the English language Twitterverse is Karla Ortiz, using the emerging hashtag #CreateDontScrape and #HumanArtists. Just in the last 24 hours they have organized a kind of demonstration on popular artist’s site ArtStation. This (at least briefly) yielded the following visual on the homepage:
The artists have started to notice the silence coming from publishing and media industry more generally, so far just in the odd tweet. But it will look increasingly odd that publishers do not have a position.
So what should media companies say, without being considered Luddite?
How about something like this:
The generative AI models are a remarkable scientific breakthrough, different in nature from previous machine learning and automated text processing systems that we have built a business licensing our content for. We are actively exploring the potentials and pitfalls of these new tools, while continuing to license content for other forms of machine learning.
As an industry that has evolved well-tested and robust institutions for ensuring quality and reliability of content, we are highly concerned about the potential risks raised by generative AI, including that of spreading false and harmful information while appropriating others’ voices, including voices of authority, but also individual voices of creators, and voices of underrepresented minorities.
We note that published books and journals constitute a large portion of the text that the latest large language models were trained on, and that this includes not only material made available by publishers under license on the internet, but also material illegallly placed online by pirates.
We disagree with the claim that the training of Large Language Models on copyrighted content is allowed under US fair use or the UK or EU exceptions for text and data mining.
The claim that models “read” and “learn” in the same way humans do is a weak metaphor that does not do justice to the way the models work, and should not drive policy or judicial decisions.
Whatever reliability as to facts that the models do have comes because they have ingested reliable content from publishers.1
We note also that the latest improvements in the LLMs comes from integration of the language models with systems to look up reliable reference information (WebGPT), to use human input to better align AI output with the intentions of AI users (Instruct GPT), and to do more detailed tagging of reliable published information (Galactica).
The eventual deployment of these tremendous technologies for the public good will be best done in partnership with copyright industries and individual creators, both to compensate them for the fact that their works are represented inside the pre-trained models, and to involve them in the work of AI alignment and providing the feedback that will help train models that are truly useful and beneficial.
OK, that’s a bit long. It’s hard to be brief.
Another matter to consider in this debate is that many of the most pro-AI folks, the ones who are pushing hardest for rapid development of models, including some of the top scientists, very very clever people, really do believe that artifical general intelligence is on the horizon, somewhere between five and fifteen years in the estimate of Dr John Schulman, cofounder of OpenAI. This is not just the usual techno optimism, this is a nearly religious belief that all bets are off when machine consciousness emerges in the near future. So why worry about potential social harms now?
Next week: Battle of the metaphors: how do we describe what the foundation models do?
For a stark admission of this from one of the developers of OpenAI, see John Schulman’s interview on the TalkRL: The Reinforcement Learning Podcast, mentioned in the last para also, at https://www.talkrl.com/episodes/john-schulman. Very jargony in parts, but also full of fascinating bits. The relevant part occurs around 22:14 into the interview
“One way to think about what language models do is they're trained to imitate the whole internet. And the internet is written by lots of different people and has lots of different types of content from fiction to nonfiction to like technical, like detailed technical literature to like jokes and like forum posts, whatever.
“So the model is basically an ensemble of all these people who wrote stuff on the internet, the raw pre-trained model. When you feed it a prompt, what it's doing internally has to be something like figuring out who wrote this prompt and then trying to continue in that style.
“So if it thinks it's reading, just reading something on the Wall Street Bets Reddit, it's gonna continue on that style. But if it thinks it's in the New York Times, it's gonna write in a very different way. So effectively, the model must be calculating somewhere, like what style is this or what ensemble, what's the narrower ensemble of styles that I'm trying to imitate now. At the very least, when you do some kind of, when you do training like either supervised fine tuning or all from human feedback, you can at least like narrow down the set of styles the model is producing and try to imitate like the best or the best person in the training set or the best style in the training set.”