How Big Tech is exploiting existing copyright exceptions

and other developments of the week ended October 8th

Oct 08, 2022

A weekly email update on developments in the world of large language models (LLMs) and image (and just in the last two weeks, video) synthesizers, with a specific focus on the question of how the legal uncertainty around these models will be sorted. For more background, please read here.

Last week we mentioned Meta’s new Make-a-Video model that creates short videos based on text prompts. Such is the new awareness of the issues involved in these models that it only a took a few days for someone to start to dig into the sources on which this model was trained. According to Andy Baio, the model was trained on 10.7m videos scraped from stock imagery site Shutterstock (the WebVid-10M dataset), and 10m video clips copied by Microsoft Research Asia from Youtube. The videos on the Shutterstock website are for sale, but are publically availabile in lower resolution and water-marked versions. These were the versions that were scraped, watermarks and all.

Baio read the academic paper by the folks who compiled the first dataset to find that they were aware of the copyight implications, and did the copying under the UK data-mining exception, “The use of data collected for this study is authorised via the Intellectual Property Office’s Exceptions to Copyright for Non-Commercial Research and Private Study.”

So the dataset is compiled by researchers, but then it is used by Big Tech to create the models that are transforming the whole field of creative work. As Baio headlined this week’s must-read:

AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability

Biao and associate Simon Willinson had a week earlier published their findings from exploring the LAION data used to train image synthesizer Stable Diffusion, the subject of last week’s newsletter. Given the enormous size of this dataset, the tool explores only 2% of the image data. This small subset, not a random sample, shows that images came from a variety of domains hosting copyrighted material, including a million images from pinterest.com. Blog hosting sites were an important component of this, as were shopping sites, and stock image sites, including Shutterstock and Getty Images. The more complete analysis, including by artist, is here.

“It’s become standard practice for technology companies working with AI to commercially use datasets and models collected and trained by non-commercial research entities like universities or non-profits. In some cases, they’re directly funding that research.” - Andy Biao

LAION’s computing time was sponsored by Stability AI, the company which then turned around and raised a rumoured $100m in venture capital on the basis of the success of the model.

The commercial/non-commercial distinction was always going to be difficult to define and defend. But it is clear that BigTech companies are actively pushing the data gathering work to third party non-profits to avoid responsibility. The Common Crawl dataset seems the paragon of these efforts, chaired by ex-Googler Gil Elbaz, with well-known Silicon Valley figures like Carl Malamud and Nova Spivack on the Board.

So I guess that explains why my digital art has watermarks…?

In an adjacent area of cyberspace, a well-known digital artist was wondering why the images she was generating were showing watermarks

Mario Klingemann 🇺🇦 @quasimondo

Does anyone know if there is a site that uses this type of watermarking? Or is this some artist's signature style? Just got a whole bunch of them and quite like the principle, but would not want to plagiarize out of ignorance. #stablediffusion

UK House of Lords hearing on the future of the creative industries to be addressed by a chatbot … this is being hyped as the first robot to address the House of Lords

It will evidently answer questions on whether AI is a threat to creative industries.

It is exactly this kind of stunt that is obscuring the real debate on these matters. Leave it to the Lords to fall for this sort of thing…

If they don’t want to talk to a bot named Ai-Da, the Lords could chat with Musk, Xi, Trump or Oliver Wendell Holmes

The Washington Post has a useful piece on the launch of Character.ai, a “create your own chatbot” service launched by former Googlers. The chatbots are built on LLMs and can be trained to take on certain voices or tasks. In the wee jetlagged hours I created an Oliver Wendell Holmes Jr chatbot if you’d like to meet him (though you will need to register, etc.)

The piece quotes Timnit Gebru, and this describes well the experience of putting together this newsletter over the last few weeks:

The speed with which industry fascination has swerved from language models to text-to-3D video is alarming when trust and safety advocates are still grappling with harms on social media, Gebru said. “We’re talking about making horse carriages safe and regulating them and they’ve already created cars and put them on the roads,” she said.

And in other news…

Springer Nature announces that it is trialing AI-powered editing tools

Well so is NUS Press, but we didn’t make a big noise about it! ;-) The narrative from Springer Nature continues to reinforce the conceptualisation/instantiation divide, the idea that coming up with the big ideas is the important and creative thing, the rest is grunt work. According to the press release, the new tool will allow authors “to spend less time preparing their work for publication and have more time doing the research that drives society forward.” In the humanities, “preparing the work for publication” is the research.

Article draws attention to the use of image synthesis to enable fraud in scientific publication.

“such advanced generative models threaten the publishing system in academia as they may be used to generate fake scientific images that cannot be effectively identified. We demonstrate the disturbing potential of these generative models in synthesizing fake images, plagiarizing existing images, and deliberately modifying images. It is very difficult to identify images generated by these models by visual inspection, image-forensic tools, and detection tools due to the unique paradigm of the generative models for processing images.”

AI and Copyright