AI and Copyright

Share this post

We’re all going to poop rainbows!

aicopyright.substack.com

Discover more from AI and Copyright

How will the legal uncertainty around AI models be resolved?
Continue reading
Sign in

We’re all going to poop rainbows!

A (more or less) weekly email update on developments in the world of Foundation Models, with a specific focus on the question of how their legal uncertainty will be sorted

Peter Schoppert
Oct 27, 2022
Share

Shutterstock announces a payment path for artists whose works have been used to train AIs...

The first news we have that some copyrighted material was indeed licensed in order to train one of the big Foundation Models (in this case DALL-E) comes from the press release announcing the new Shutterstock-DALL-E partnership, which includes a path to paying artists whose work was used to train DALL-E.

From the press release here:

“The data we licensed from Shutterstock was critical to the training of DALL-E,” said Sam Altman, OpenAI’s CEO. “We’re excited for Shutterstock to offer DALL-E images to its customers as one of the first deployments through our API, and we look forward to future collaborations as artificial intelligence becomes an integral part of artists’ creative workflows.”

and

Shutterstock believes that AI-generated content is the cumulative effort of its contributing artists. In an effort to create a new industry standard and unlock new revenue streams for the Company’s artist community, Shutterstock has also created the framework to provide additional compensation for artists whose works have contributed to develop the AI models. The Company also aims to compensate its contributors in the form of royalties when their intellectual property is used.

Not sure the legal copyright position implied in “AI-generated content is the cumulative effort of its contributing artists”, exactly, but the point is one many artists agree with.

Questions:

  • Remind me to check the licensing terms for selling images on Shutterstock. Will Shutterstock artists be given an opt-out for future model training?

  • Is this a bid to reduce the legal uncertainty around the troublesome “fair use” claim around training data? Or a bid to recruit artists? Or both?

  • So how exactly did OpenAI train DALL-E2? What other datasets were used, and can we expect more such deals in future? Were Shutterstock images licensed to create the model? Or to tune it?

  • If Shutterstock images were licensed, couldn’t other datasets to build (or tune, or refine) other models have been licensed? (See next point on Stability’s business model)

  • What does this mean for model-builders who used huge amounts of copyrighted text to train their models without a license?

This steals the thunder a bit from Emad Mostaque’s almost announcement that Stability.ai saw the artists’ claim on training data to be legitimate. See next section.

Stability AI comes out in Style

The Stability coming out party at the Exploratorium in San Francisco, after the announcement of their Series A funding round of US$ 101m seems to have been quite the event. I haven’t yet watched the video (available here), but one quote from Stability founder Emad Mostaque awas captured by the New York Times report:

“So much of the world is creatively constipated, and we’re going to make it so that they can poop rainbows,”

Er, OK! Let’s not put that in a image generation prompt thank you very much.

In addition to the NYT story is an interview that Mostaque recorded with Kevin Roose and Casey Newton. The interview is well-worth a listen for many reasons (what’s your generative-content strategy?), but let’s start with the legal uncertainty question that is this newsletter’s focus. Mostaque did not address that directly, but did seem to acknowledge an obligation to the creators whose works had trained the model. He was asked a question about the concerns raised by artist Greg Rutkowski (covered in this newsletter on October 1st).

So I think this is a very valid thing, I mean the fears and concerns around that. So like I said, it was trained on 2 billion images. It was a snapshot of Google Images, basically. So anything you could find through there, you can find through here.

And then it learns from the relationship between words and images. That’s why you need a huge frickin’ supercomputer to kind of do that. So it learns principles. You can’t recreate any image from the data set, but it understands what a cup is or what Greg is and other things.

In fact, the interesting thing is our data set didn’t actually have very many pictures of Greg Rutkowski. Instead it was another model that was released by OpenAI, which had much more of his. We don’t know because we don’t know what the data set is that was part of this that introduced the concept of Greg into this model.

But at this point in the interview (around 41:33 into the podcast), Roose interjects with an annoying (to me) fit of techbro giggling that allows Mostaque to escape without having to say what he means in more detail.

Newton attempts to get back on track, but his follow up question is too general, and this allows Mostaque to pivot to a broader point about “new artistic tools don’t mean the end of art, photography didn’t kill painting…” which is all well and good but avoids the copyright / moral right issue that Mostaque was addressing moments before. See the transcript here.

Evidently Mostaque made further comments on the possibility of compensating artists in a “fireside chat” with Elad Gill, but I haven’t been able to track that down. In any case I would not be surprised to see Stability make a statement in this area soon.

The Mostaque interview is well worth a listen/read, and I’ll be covering different aspects of it in future notes. He lays out more of the Stability business strategy, which includes customized versions of the Stable Diffusion model, fine-tuned on specific branded content, under license, as per this tweet (as proof-of-concept, not sure @Nitrosocke licensed her training data):

Twitter avatar for @_akhaliq
AK @_akhaliq
Spider-Verse Diffusion: fine-tuned Stable Diffusion model trained on movie stills from Into the Spider-Verse by @Nitrosocke @huggingface model: huggingface.co/nitrosocke/spi…
Image
4:47 PM ∙ Oct 26, 2022
335Likes34Retweets

CNN journalist Rachel Metz on the artist training question…

This is a very nicely put together story. I gave her this feedback…

Twitter avatar for @katong
Peter Schoppert @katong
@rachelmetz Great write-up. I would not be sure that fair use covers the unauthorised copying that created training data. The fairness of fair use must be based on the facts of the case, tested against a number of factors, including the economic impact of the copying.
10:13 PM ∙ Oct 26, 2022

More journalism: coverage of the potential Copilot suit

In last weeks’ description of the impact of the Copilot suit I missed a really good update from the UK’s The Register. Another CEO claiming a Fair Use defense for the training data:

When GitHub introduced a beta version of Copilot in 2021, and questions about copyright and licensing were raised, then-CEO Nat Friedman opined "training ML systems on public data is fair use [and] the output belongs to the operator, just like with a compiler. We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!"

If you’ve read this far in the newsletter you must agree it is more than interesting, it’s downright fascinating!

There’s another killer quote in the piece:

When Berkeley Artificial Intelligence Research considered this issue back in 2020, the group suggested that perhaps training large language models from public web data is fundamentally flawed, given concerns about privacy, bias, and the law. They proposed that tech companies invest in collecting better training data rather than hoovering up the web. That doesn't appear to have happened.

Another tweet of interest

Twitter avatar for @tszzl
roon @tszzl
an individual human lifetime is more analogous to few shot prompt engineering of the pretrained human foundation than learning language or reasoning from scratch
en.wikipedia.orgBaldwin effect - Wikipedia
9:25 PM ∙ Oct 25, 2022
116Likes2Retweets

That makes me think there’s some good work to be done to figure out what LLM emergent properties tell us about the Chomsky vs Skinner debate on human language… But that’s a whole ‘nother realm…

Thanks for reading AI and Copyright! Subscribe for free to receive new posts.

Share
Previous
Next
Top
New

No posts

Ready for more?

© 2023 Peter Schoppert
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing