The Do Right Gang goes to the White House
The Biden White House went for another AI photo op and announced that it had “secured” voluntary commitments from seven AI companies, “marking a critical step toward developing responsible AI. As the pace of innovation continues to accelerate, the Biden-Harris Administration will continue to remind these companies of their responsibilities and take decisive action to keep Americans safe…”
The announcement got a lot of uncritical coverage. The New York Times’ Kevin Roose (the one Sydney loves…) was an exception, at least reading the commitments, and doing a breakdown of what they all mean, concluding:
“Overall, the White House’s deal with A.I. companies seems more symbolic than substantive. There is no enforcement mechanism to make sure companies follow these commitments, and many of them reflect precautions that A.I. companies are already taking.”
But I think Roose was actually a bit too soft here. Let me just highlight three points where Roose could have dug a bit deeper.
Commitment 3: Invest in cybersecurity and insider threat safeguards to protect proprietary and unreleased model weights
Companies making this commitment will treat unreleased AI model weights for models in scope as core intellectual property for their business, especially with regards to cybersecurity and insider threat risks. This includes limiting access to model weights to those whose job function requires it and establishing a robust insider threat detection program consistent with protections provided for their most valuable intellectual property and trade secrets…
How nice to see intellectual property make an appearance in this statement! But how sad and disappointing that the intellectual property of publishers and creators is completely ignored, especially when some of America’s most potent creators have called a once-a-century industrial action over the issue! (The last time both the US Writers Guild and Screen Actors were on strike at the same time was in 1960.)
No, the government is pushing the AI companies to tell the world that they will treat their future model weights as “core intellectual property”. Model weights —the way the parameters have been arranged and changed by running training data ie works through the model multiple times— must be guarded carefully. Now they are not just “most valuable intellectual property” but on par with trade secrets. (At least until they release the weights open source…)
So in a few short months we’ve gone from at least some vague transparency on training data and model training architecture to OpenAI claiming that any information about training processes or data is business confidential to now, with the White House saying all that info is best kept secret from even those within the companies whose job doesn’t require that they know. This doesn’t seem to be setting the stage for either more better science of understanding LLMs nor more transparency and informed regulation. It’s hard to see how the LLMs are so strategic to national security that the White House has to get involved in cheerleading for greater security.1
Commitment 5: Develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated, including robust provenance, watermarking, or both, for AI-generated audio or visual content.
Companies making this commitment recognize that it is important for people to be able to understand when audio or visual content is AI-generated. To further this goal, they agree to develop robust mechanisms, including provenance and/or watermarking systems for audio or visual content created by any of their publicly available systems within scope introduced after the watermarking system is developed. They will also develop tools or APIs to determine if a particular piece of content was created with their system. Audiovisual content that is readily distinguishable from reality or that is designed to be readily recognizable as generated by a company’s AI system—such as the default voices of AI assistants—is outside the scope of this commitment.… [continues]
Very few of the news reports highlighted that this labelling applied to audio and visual output only. It apparently is not important that people understand when they are interacting with an AI in text, as in receiving AI-generated email or in chats. Tim Wu, who was until recently at the White House, has been arguing that there should be such a requirement. Yuval Harari has taken the logic further, arguing that we should punish those who fake human interactions like we punish those who fake money. It is certainly as equally bad for social trust!
I can see that companies making their models available via API would not want to limit their businesses in this way. And the companies are given an out here by not needing to watermark content that is “readily distinguishable from reality”. I wonder what content that is, in a world where the most respected leaders in the Senate are calling for hearings to talk about alien technology being used by secret groups in the military.
OpenAI has been watermarking Dall-E2 output since release, and Stability AI is not a party to this commitment: it is their open source model that is most easily adapted for harmful purposes.
Commitment 7: Prioritize research on societal risks posed by AI systems, including on avoiding harmful bias and discrimination, and protecting privacy
Er, how about actually protecting privacy rather than just agreeing to do more research on protecting privacy… After all, the FTC is already investigating some of these same companies for not protecting privacy…
Billions not Millions
I mentioned the Screen Actors Guild (SAG) strike, and the Writers. AI is not the only issue here, but it certainly is not just a bargaining chip. I remember being very impressed by the testimony of Paul Fleming, General Secretary at Equity, the UK performing arts union, in his testimony to the House of Lords last year, for his grasp of the real risks to performers. So it’s no surprise to me that this is a live issue for SAG members now, as actors talk about going for a days’ worth of full-body and facial scans for a minor role, and not being told how and why their physical likeness and voice will be used.
Creators are fired up, and it’s hard for me to see how governments can ignore their interests in this discussion, as many have so far. Aside from the strikes, and the recent spate of lawsuits, there was this recent joint statement from recording industry and publishing bodies like the Federation of European Publishers and STM (representing Science, Technology and Medicine publishers). Others are on the way.
And the big newsmedia companies are gearing up. Semafor has a must read report that Barry Diller’s IAC is joining forces with The New York Times, NewsCorp, the New York Times and Axel Springer in a coalition to potentially take legal action, and to press for legislative action on the use of their copyrighted material for training LLMs.
The quotes are getting better! Media companies have some media game after all?
“The most immediate threat they see is a possible shift at Google from sending traffic to web pages to simply answering users’ questions with a chatbot. That nightmare scenario, for [IAC CEO Joey] Levin, would turn a Food & Wine review into a simple text recommendation of a bottle of Malbec, without attribution.
“The machine doesn’t drink any wine or swirl any wine or smell any wine,” Levin said.
“Search was designed to find the best of the internet,” he said. “These large language models, or generative AI, are designed to steal the best of the internet.”
And these publishers are talking billions, not millions.
Meanwhile other companies are doing deals. The Associated Press announced a partnership with OpenAI to license its archive to the company. So, er, I guess it is possible to establish a licensing market for AI training material? We don’t know how much money was involved, so are not sure why the AP was so quick to cross the picket line…
and then there’s Bloomberg..
Long-time readers of the newsletter will remember how disappointed I was back in April with Bloomberg’s decision to train its inhouse LLM on the Pile, “publicly available data” which includes a minimum of 197,000 pirated ebooks, huge amounts of news content from their competitors, and many other sources of publisher material that is being used without permission. I mean, great idea to learn more about the tech, but some recognition of your interest as a producer of content would have been useful.
My all time favourite AI podcast, Sam Charrington’s TWIML, recently ran an interview with David Rosenberg, head of the machine learning strategy team in the Office of the CTO at Bloomberg. Rosenberg is very open in this session, and I learned a lot here, on the technical reasons for the company’s decision to proceed as it did, and their process for creating the model. (Ironically part of the reason they created their own model is that they didn’t trust the companies like OpenAI to process Bloomberg data…)
TL;DL: The model is not being used in production at the moment, evaluation continues to be difficult, and hallucination remains a problem, the same for Bloomberg as any one else (despite half the training data being Bloomberg’s own corpus of information - quality of data does not solve hallucination!) It looks like the first use cases to eventually go into production will be internal, as a sort of co-pilot helping Bloomberg’s teams query and extract value from their own massive database, helping with tasks like Named Entity Recognition (tagging people and companies, etc), all that good old Natural Language Processing we were working on before the models started predicting text.
Not particularly positive
Google meanwhile is going around meeting newspaper publishers, dangling US$ 5m grants and hoping to sell them on their own Genesis tool, a sort of newsroom assistant for journalists. Casey Newton at Platformer, is not so sure.
Still, I find it hard not to be cynical about these developments, particularly in Google’s case. Here you have a company that reshaped the web to its own benefit using Chrome, search and web standards; built a large generative AI model from that web using (in part) unlicensed journalism from news publishers; and now seeks to sell that intelligence back to those same publishers in the form of a new subscription product.
Courts will decide whether all that is legal. But if you value competition, digital media, or the web, none of these developments strike me as particularly positive.
And if in addition you value publishing, a functioning marketplace of ideas, authorial integrity, a media ecosystem that retains some elements of trust, these developments seem even less positive.
Other kinds of AI might need that sort of protection. But LLMs?