Really, Bloomberg?

Even media companies are training their models on pirated data... should I just give up?

Apr 01, 2023

There’s lots going on. The letter, the ban, the backsliding. See below. But first, some news from Bloomberg that is quite remarkable…

BloombergGPT(TM)

Bloomberg announced on March 30th, the creation of its own Large Language Model, BloombergGPT(TM), trained on a combination of its own large cache of data, and what it called “a 345 billion token public dataset.” At first I thought this was great news, a media company creating its own model from scratch, leveraging its own data, but then I saw that at least a quarter of the total amount of text used to trained the was from our old friend “the Pile”, which includes Books3, 197,000 pirated ebooks, as well as other data crawled without permission and against terms of use from The New York Times, Guardian, LA Times, etc, etc. It also includes the ur-LLM dataset (along with Wikipedia) BooksCorpus, which is some 7,000 novels copied without permission from the self-publishing site Smashwords.1 At least Bloomberg shared info on its training data, unlike OpenAI.2

The quality of machine learning and NLP models comes down to the data you put into them,” explained Gideon Mann, Head of Bloomberg’s ML Product and Research team. “Thanks to the collection of financial documents Bloomberg has curated over four decades, we were able to carefully create a large and clean, domain-specific dataset to train a LLM that is best suited for financial use cases. We’re excited to use BloombergGPT to improve existing NLP workflows, while also imagining new ways to put this model to work to delight our customers.”

Yes, well, and you’ve put in copyrighted text that belongs to your peers and competitors, without their permission. Will Dow Jones have something to say about this? Remember this statement given to a Bloomberg reporter on Feb 17th.

“Anyone who wants to use the work of Wall Street Journal journalists to train artificial intelligence should be properly licensing the rights to do so from Dow Jones,” Jason Conti, general counsel for News Corp.’s Dow Jones unit, said in a statement provided to Bloomberg News.

Well, I don’t know what to think really, when even media companies are buying this myth of “public data”. Does it show how powerful is the reality distortion field created by the AI tech companies, and how poorly educated media management is on the issues? (Despite my dedicated efforts… ;-) Or has Bloomberg made a decision to throw its lot in with the tech companies?

The letter

In case you’ve been under a rock, the open letter orchestrated by the Future of Life Institute has garnered a massive amount of press coverage. Factiva shows 618 stories on the topic in newspapers around the world, most citing Elon Musk’s signature in the headline.

The letter is here, and it is to my taste quite spacy, much more focused on the longer term risks of super-intelligent AIs with some sense of volition than the short-term risks of common-or-garden large language models.

The letter is being supported by people from two groups. On the one hand we have the AI Ethicists who worry about all the ill-effects of large language models saying nasty things, bringing the cost of misinformation even lower, super-powering existing social ills, further eroding sources of authority and trust in society, drowning us in bullshit, etc, etc. This is a relatively short-term risk. Or rather one that is already present.

The second group supporting the letter is drawn from the AI Safety community. These are the folks who are more concerned about the world being turned to paper clips by a rogue AI, and who are cautious in light of the uncertainty about how the models work exactly.

These two groups don’t really care about each other’s concerns. (hat-tip Jon Stokes). The Ethicists think the worrying of the Safety folk is just bad scifi, and the Safety folks think the Ethicists are too focused on woke issues and don’t see the bigger picture of the power of these models.

Of course there were folks in both camps who refused to sign the letter. Emily Bender for example finds that the letter is an example of the over-hype of AI that is a huge problem on its own, and that the letter plays into the hands of the AI companies by buying into their vision of what’s happening. On the other side of the fence, Eliezer Yudkowsky didn’t sign the letter because he believes it is too soft. All AI activity should be banned immediately in his view, expressed in a Time magazine essay the day the letter was released.

The rule that most people aware of these issues would have endorsed 50 years earlier, was that if an AI system can speak fluently and says it’s self-aware and demands human rights, that ought to be a hard stop on people just casually owning that AI and using it past that point. We already blew past that old line in the sand. And that was probably correct; I agree that current AIs are probably just imitating talk of self-awareness from their training data. But I mark that, with how little insight we have into these systems’ internals, we do not actually know.

I definitely concede the last point. There is a lot we don’t know, and without access to the training data we know even less… (I also like the fact that Yudkowsky also called out Microsoft CEO Satya Nadella’s appalling gangster talk which had me upset in the Feb 12th issue of the newsletter (seems like so long ago!))

It is in fact not even possible to define what model development the signatories want to limit as we don’t know the technical specs of GPT-4.

The Ban

Well, it made the front page of my weekend Financial Times, so I guess you don’t need the newsletter for this info! Italy’s privacy regulator announced a (temporary) ban on Chat-GPT. The statement says Chat-GPT is not GDPR compliant. Moreover children are signing up for Chat-GPT and getting completely inappropriate responses. According to the article in Politico which seems to have broken the story:

“The authority said the company lacks a legal basis justifying ‘the mass collection and storage of personal data ... to “train” the algorithms’ of ChatGPT. The company also processes data inaccurately, it added.

“ChatGPT also suffered a data breach and exposed users conversations and payment information of its users last week, the Italian authority said. It added OpenAI does not verify the age of users and exposes ‘minors to absolutely unsuitable answers compared to the their degree of development and self-awareness.’”

Are we heading for a huge backlash? Too soon to say. Some snarky comment from Emad Mostaque (a signatory of The Letter BTW)

Emad @EMostaque

Do you think Italy banned ChatGPT because the Pope was annoyed at us putting him in so much drip? #Balenciaga

Emad @EMostaque

Well this is interesting https://t.co/DAcnq1M9es

The Backsliding?

Here I’m referring to the UK White Paper on AI Regulation, published on March 29th, which pessimists will see as preparing the groundwork for the re-introduction of the broad TDM exception.

The White Paper does not actually include the word “copyright” (except in its copyright statement of course…), and it does not rehearse the various arguments. But it says that the UK will implement recommendations of Chief Scientific Officer’s review of AI and intellectual property policy, which had been published two weeks earlier.

That recommendation was that “Government should announce a clear policy position on the relationship between intellectual property law and generative AI to provide confidence to innovators and investors.”

Well, er, OK, but what should that position be? The report is curiously uncommitted on this, but here is one key sentence:

“The government should work with the AI and creative industries to develop ways to enable TDM for any purpose, and to include the use of publicly available content including that covered by intellectual property as an input to TDM (including databases).”

Which sounds very like it could be a re-entry of the broad TDM exception already floated by the IP Office and then rejected. However in this case creative industries are asked to help develop this method… Is that just a nod in their direction, or is a hint that the government will try and orchestrate a discussion for some sort of statutory license?

Vallance and his co-authors also include this puzzling sentence “The content produced by the UK’s world-leading creative industries, including for generative AI, is fundamental to the success of the tech sector. These sectors are the UK’s strengths and their success is central to realising our growth ambitions and they should continue to grow in partnership.”

How odd that the UK’s world-leading creative industries are characterized as having produced their works for generative AI…. and that the success of the tech industry in the UK depends on the creative industries… Or does that mean success of the tech industry depends on the creative industries giving up all the value they’ve created in their content? Or am I being way too cynical?

The other point to note is that the UK is a loooong way from being a real player here. As noted in something of a dissenting view from policy advisor Jack Clark, “without DeepMind the UK’s share of the citations amongst the top 100 recent AI papers drops from 7.84% to just 1.86%”, neck and neck with Hong Kong and Switzerland. DeepMind is owned by Google if you remember. He also points out how even if current aggressive plans to add computing capacity to UK research labs are implemented, “the entire nations’ compute capacity in 2026 [is] behind one relatively small frontier US lab in 2022.”

Seems to me that the UK regulators are better off supporting creative industries in their efforts to get a stake in this emerging ecosystem. We shall see, and await word from the creative industry associations in the UK.

Some more developments on licensing for AI training

At the interesting Town Hall called by the US Copyright Clearance Center on Linked In on the evening of March 30th, Gordon Crovitz and Stephen Brill talked about their NewsGuard product, which they license to Microsoft for use in Bing Chat. The data ranks the reliability of news sources, to help in the model patching of Bing’s results. They also have a list of Top 100 misinformation narratives which they license to Microsoft as well, presumably so that Microsoft can filter out bad narratives that its language models produce. Again, this is not training for the LLMs themselves, but for the patches, the guardrails, etc.

In an interview with the Wall Street Journal for an article on media company claims on training data3, Open AI’s Sam Altman was the tiniest bit more explicit about how the company understands the legal status of their training data. “We’ve done a lot on fair use” he says, but also admits that OpenAI has licensed data “when warranted… We're willing to pay a lot for very high-quality data in certain domains,” such as science, Mr. Altman said. Not said: Of course we don’t pay for high-quality data when we can just grab it without permission...

and another copyright case in the making

Independent AI Engineer Shawn Pressner is the person who prepared the Books3 dataset. He’s used the recent Bloomberg announcement to remind everyone that he is behind this. (Go here to see if your books are included). He also played a role in spreading the leaked Facebook LaMDA model. He is also getting involved in the defense of internet users subject to DMCA takedowns from Meta for leaking the aforementioned LLaMDA model, funded by an anonymous US$ 200,000 donation. He says “I intend to initiate a DMCA counterclaim on the basis that neural network weights are not copyrightable.” He cites the recent US Copyright Office ruling denying copyright in AI-created images, arguing that it shows the models themselves cannot be protected by copyright.

The legal thinking sounds muddled (he‘s shopping for a lawyer), but his motivation is clear, “This would be a step towards a future where copyright itself has less power.”

He clarifies: I believe that businesses should have the right to control their own intellectual property. But it’s not obvious that NN weights are property. They’re created from society’s data, just as phone books were, and phone books aren’t subject to copyright. Italics mine.

Sorry Shawn, 197,000 pirated ebooks are not “society’s data”.

One final cheerful thought

Francis Jervis @f_j_j_

If GPT-4 worries you more than this, I think you're worried about your place in capitalism, not AI safety

thedrive.comSix F-16s Getting Autonomous Computer Brains For Combat Drone TrialsDubbed Project VENOM, the findings gathered under the effort will feed into the service’s overarching Collaborative Combat Aircraft program.

Jack Bandy and Nicholas Vincent, ‘Addressing “Documentation Debt” in Machine Learning: A Retrospective Datasheet for BookCorpus’, 11 November 2021, https://openreview.net/forum?id=Qd_eU1wvJeu.

Shijie Wu et al., ‘BloombergGPT: A Large Language Model for Finance’ (arXiv, 30 March 2023), http://arxiv.org/abs/2303.17564.

Keach Hagey, Alexandra Bruell, Tom Dotan and Miles Kruppa, “Publishers Seek Pay For Help With AI”, The Wall Street Journal, March 23, 2023

AI and Copyright