The more I think about this...
Gentle readers, I must admit I’m still reading the Henderson preprint “Foundation Models and Fair Use”1 (91 pages and full of fun stuff), and I haven’t yet turned to the EU draft AI Act, which now has the voting text published (less fun stuff).
And I should try and do a bit of preparation for a May 28th appearance on a panel titled “The Chatbot Made Me Do It”, part of the Asian Festival of Children’s Content here in Singapore, with author and Little Brown Books for Young Readers editor Erika Turner, and NUS Vice Provost, law professor and YA author Simon Chesterman. If you are in Singapore, come along! This and a trip to Sumba next week mean I may be a bit slow on the newsletter front…
I cannot judge
OK, one bit of news I spent some time on: Judge Tigar has ruled on the motions to dismiss filed by OpenAI and Microsoft/Github in the Github Copilot case in the Northern California district court.
The broader arguments for throwing out the entire case were rejected by the judge, and even though the plaintiffs were told they had not demonstrated the sort of very specific injury that would allow them to claim damages to their property rights, they were allowed to claim injunctive relief.
Many of the more specific defendants’ motions to dismiss were granted, but in all but two cases the plaintiffs were invited to amend their claims in light of the judge’s feedback. Other claims survive untouched, those relating to two clauses in the Digital Millenium Copyright Act about removal of license information, and a common-law claim of breach of open source license.
Some Twitterati see in the opinion a judge ready to write an injunction… in fact even guiding the plaintiffs in that direction. Others see a case in trouble. Remember, this has never been a copyright case, the plaintiffs chose not to bring a copyright claim. In fact the judge did agree with the defendants and dismissed the claim of unjust enrichment because it should have been brought under a copyright claim.
the more I think about this…
I haven’t finished my second, more careful, reading and annotating of the Henderson paper, but it continues to fascinate. I’m intrigued by the way the authors stick to their assertion that LLMs create output which is a transformative use of the material they were trained on (and therefore fair use), except, well, when they don’t. (The paper cites the key papers that show how they routinely regurgitate their training material.) The direction set by Henderson et al is to try and find ways to “solve” that unfortunate tendency. The more I think about this, the more I wonder, “why not just license the material”? Is that really so much harder than rewiring the models?
To be fair, the authors don’t downplay this problem, they just want to engineer it away. If we can patch up the output, then we don’t have to worry about the bigger problem that the models were trained without permission. The more I think about this, the more I wonder, “why not just license the training data”? Is that really so much harder than rewiring the models?
It gets even weirder: It seems odd to me that the authors stick to this line but also open another huge door by raising another situation when model output would not be transformative, that is when the model’s output narrows or harms the market for the content copied. Here’s the relevant para:
“When the downstream product based on such a model is not transformative (e.g., model outputs similar content to copyrighted training data, or the application’s market is similar to original data markets), courts may decide that the generated content, the model deployment, and even potentially the model parameters themselves are not covered by fair use (Sobel, 2017).“
Henderson et al cite the prescient Sobel paper which raised all these issues five years ago… Ben, you saw the future…
There’s a fair amount of market harm about!
Nearly every issue of this newsletter going back to September will have examples of market harm or narrowing, and that’s building with incredible speed. Here are a couple recent cases:
“Stack Overflow is ChatGPT Casualty: Traffic Down 14% in March”
“In fact, on a year-over-year basis, [traffic to Stack Overflow has been down by an average of 6% every month since January 2022 and was down 13.9% in March.” This is because “The new lazy-but-efficient coding trick is to prompt ChatGPT, CoPilot, or Bing Chat to write big chunks of code for you.”
And I’m sure the readers of this newsletter wouldn’t have missed the news that the rise of the chatbots is hitting educational publisher and edtech companies. Online study aid site Chegg lost 48% of its market cap when it attributed its traffic declines to students using Chat-GPT. (The excuse was worse than the bad news…) Pearson took a 12% hit the same day. I got this news from a Guardian article: “AI race is disrupting education firms – and that is just the start”.
Let me be clear, Stack Overflow is as liable as any of us to be disrupted by new products and better offerings. But the issue here is that the new product that is killing them was built by copying their material without their permission.
More clarity: I’m raising all these issues (and writing this newsletter) in part because I want to use these tools (as tools to help and empower humans, not replace them). I have lots of use cases for my existing publishing operation, and see lots of other possibilities that I’d like to explore. But I want to use them in good conscience, and I want to do so while respecting and protecting the best parts of the system of writing, publishing and reading that we have evolved over all these centuries. I don’t want to see a huge part of the value we’ve created over the centuries (publishers, writers, librarians, booksellers, readers) just eroded or transferred to tech companies as a reward for their audacity. The more I think about this, the more I wonder, “why not just license the training data”?
Henderson, Peter, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley, and Percy Liang. ‘Foundation Models and Fair Use’. arXiv, 27 March 2023. http://arxiv.org/abs/2303.15715.