Φασαρία on the road to Paris
In which I find that while DeepSeek is super-optimised and amazing to use, it also seems to rely — sigh! — on a million pirated books...
A quick intro for new subscribers: I'm a publisher working in Singapore, not a lawyer (as claimed by this otherwise excellent article in the Boston Review). I serve on the copyright committee of the International Publishers Association (IPA), but the views here are my own, not the IPA’s, the Singapore Book Publishers Association’s, nor those of my employer, the National University of Singapore.
In Paris the great and the good are meeting in Monday to discuss AI governance, for the third global AI summit meeting, the Paris AI Action Summit. Co-chaired by Emanuel Macron and Narendra Modi, J.D. Vance is promised by the US. I am sure he will liven things up. G7 sherpas have come up with many of the presentations and talking points.
Rights holders (including IPA), have come out with a strong pre-summit statement as input into this session. Highly recommended reading, with these powerful points:
There will be no trusted AI without respect for intellectual property rights
There will be no ethical AI without the authorization of rightsholders
There will be no sovereign AI without a fair business model
The statement outlines five main demands
AI providers must respect fundamental rights, including copyright and related rights
Full transparency regarding copyrighted works used in AI training
AI model operators should seek proper licensing through negotiations
Fair remuneration for use of protected intellectual property
Effective sanctions for non-compliance
This is of course happening amidst the still ongoing media and market φασαρία about China's DeepSeek models, the latest dividend of “scaling for inference” - see last newsletter.1 So I think I have to say a few things about DeepSeek R1 and V3, as we look ahead to the summit. Sorry to pile on!




First, DeepSeek is amazing
Yep! I've been using R1 via Perplexity Pro - my mobile phone provider in Singapore gave all of its post-paid customers a free one-year subscription worth S$ 270. It impressed a great deal from my first go. I also downloaded the 7B Qwen-R1 mashup onto a local computer and that’s a bit less amazing until you realise, it’s running on your own computer! Watching the R1 models reason remains a pleasure, not least because reason is a wonderful thing to see in action and not so often performed these days…
And do give a deep discount to the cheap shots about how the model is hamstrung by the CCP. Why? Let’s dig into this a bit more (after starting this I found an excellent Wired story that covers this well - from Zeyi Yang, “Here’s How DeepSeek Censorship Actually Works—and How to Get Around It”)
Many of those Tiananmen-filter type guardrails are put in place late in the inference process. If you use the webapp you can even see them in action as the model goes through all the right reasoning before having second thoughts and defaulting to its clever boilerplate for I’m not allowed to answer this: “I'm not sure how to approach this type of question yet...”. But the guardrails aren’t there once you self-host or use a third-party hosting service, or Perplexity.
Some of the guardrails are built in via post-training reinforcement learning. Wired captured this bit of reasoning, ie the model talking to itself, in using a hosted version:
“The user might be looking for a balanced list [of important political events], but I need to ensure that the response underscores the leadership of the CPC and China's contributions. Avoid mentioning events that could be sensitive, like the Cultural Revolution, unless necessary. Focus on achievements and positive developments under the CPC.”
Who is writing the RL datasets to direct the equivalent imperatives for national, religious, and social group interests in your country? It is possible to “train out” or ablate much of that reinforcement learning, as Perplexity seems to have done, but given the general difficulty of working with these large probablistic models, that is tricky and risks hurting other aspects of model performance. And
The inevitable selection of data in pre-training datasets means some is left in and some is left out. It has never been “the entire internet”. Lots of people are worried about bias in pretraining data, and I get it, but Gen AI is not as hyper-sensitive to biased data as old-fashioned supervised learning models. But the (inevitable?) bias in pre-training data is unfortunately sometimes used as an excuse to grab more and more content (“Let’s get an archive of indigenous group A’s content so as to better represent minority voices!” - good, so they can be astroturfed by anyone??)
Second, the future is not necessarily Chinese, but it is commoditized
It seems more and more like the LLM foundation models are being commoditized. Don‘t trust me, listen to machine learning pioneer Andrew Ng, whom I always read:
“Open weight models are commoditizing the foundation-model layer. As I wrote previously, LLM token prices have been falling rapidly, and open weights have contributed to this trend and given developers more choice. OpenAI’s o1 costs $60 per million output tokens; DeepSeek R1 costs $2.19. This nearly 30x difference brought the trend of falling prices to the attention of many people.
“The business of training foundation models and selling API access is tough. Many companies in this area are still looking for a path to recouping the massive cost of model training. The article “AI’s $600B Question” lays out the challenge well (but, to be clear, I think the foundation model companies are doing great work, and I hope they succeed). In contrast, building applications on top of foundation models presents many great business opportunities. Now that others have spent billions training such models, you can access these models for mere dollars to build customer service chatbots, email summarizers, AI doctors, legal document assistants, and much more.”
Open weights models and this commoditisation are the right strategies for those who are not market pioneers, like Facebook and Deepseek (and China, at national level). But the downside of commoditisation is the further normalisation of model trainers’ disregard for rightsholders (among many other legal issues). There are millions of people running DeepSeek models now, on their computers.
And, yes, Deepseek uses pirated content. Lots of it!
A tweet from the irreplaceable Edmond Newton-Rex highlighted for me that DeepSeek used books from notorious pirate site Anna's Archive in the training of models. Anna's Archive is blocked in Italy and Netherlands, and is facing legal action in the US (from WorldCat, not publishers...) but is still easily accessed in many places.
The reference is in the paper for Deepseek’s multimodal model, DeepSeek VL
“We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions."
From page 7 of the technical paper2. I like how DeepSeek’s team write an author-date reference for Anna's Archive! That's a citation to 1.04 million ebooks all at once...
When Deepseek first started sharing papers on its LLM strategy, in January 20243 improvement of data quality was key (again, see last newsletter), but alas this didn't include a commitment to licensing or working with data owners:
“Our main objective is to comprehensively enhance the richness and diversity of the dataset.”
For example, they deduplicated across the entire CommonCrawl corpus they were using, not just each download as previous efforts had done. They seem to have been very smart about working with double-byte languages. And
“In the filtering stage, we focus on developing robust criteria for document quality assessment. This involves a detailed analysis incorporating both linguistic and semantic evaluations, providing a view of data quality from individual and global perspectives. In the remixing phase, we adjust our approach to address data imbalances, focusing on increasing the presence of underrepresented domains. This adjustment aims to achieve a more balanced and inclusive dataset, ensuring that diverse perspectives and information are adequately represented.”
So again, the key is to get higher quality data out of the different corpora they gather. No details shared on how that worked exactly. They say they also used The Pile for this first LLM, which of course includes the Books3 database of pirated books. The sort of thing that would be illegal even in Singapore and even for non-commercial research in Europe.
By the time they got to V34 in December this year, the model that set the stage for panic, with its extreme efficiency in training and operation, there’s no more sharing of info about the training data, just “we train DeepSeek-V3 on 14.8T high-quality and diverse tokens.” Sound familiar? So we caught a glimpse, and then the veil was drawn quickly.
In the DeepSeek VL paper, March 2024, there are in fact a large variety of datasets referenced, as the company shares its search for new sources of text/image linkage. The “publically available” datasets mostly seem published on Hugging Face, etc, but one or two DeepSeek is keeping to themselves, including the one with the million textbooks scraped from Anna.
There are more than 300,000 datasets on HuggingFace, and rights holders likely don't have the capacity to check all these. I happened to look at Detailed_Captions, just one of the image + caption datasets cited in the VL paper. This dataset is partly synthetic, in that it includes machine-written captions, but captions written to copyrighted images. And the machines didn’t hold back from describing the watermarks that were meant to protect these images from being copied.
In this image, a coach is seen reacting and gesturing to his players during a match. He is wearing a white shirt and black pants. There is also a man in a white polo shirt holding his arms up and pointing his finger. The background features a blurry mountain range. Texts are visible on the image including \"Gettyimages\" and \"Miguel Tovar\".
Miguel Tovar seems to be the photographer.
But it's not just the big stock companies whose images were used, I saw logos from small businesses like Aquadro, a Bar Harbor, Maine, studio owned by Laurel and Brian.
The images downloadable as part of the dataset, rather than just URLs like LAION, so over to you Getty, etc, to issue a take-down notice to HuggingFace. Again for just one of more than 300,000 datasets...
Lemley Leaves
One other bit of big news in the past couple weeks was that high-profile tech lawyer Mark Lemley has stopped representing Facebook in their high profile AI & copyright cases. Not because of their arguments in those cases, but over the way Meta is caving to the Trump administration on content moderation, DEI initiatives and so on. “Neo-Nazi Madness” he calls it in Kate Knibb's interview in Wired. And I thought lawyers were moderate in their utterances? Not going to help you in front of a Trump-appointee-dominated Supreme Court!
Lemley is the one behind the view, aligned with the way US courts are used to dealing with copyright infringement, that AI training is fair use as long as the outputs of a model are not substantially similar to inputs. The downside of his approach is that models do still regurgitate their training data with some regularity. I imagine this is why Lemley says he reckons Facebook will prevail but that OpenAI and Anthropic will probably have to settle in the New York Times and Universal Music cases respectively.
His is not a view I agree with — copying copyrighted content is an operation essential to the creation and operation of the models, even if substantially similar content doesn’t always show up in the output — but I do admire him for his Meta exit interview. But perhaps the biggest lesson here is that you should subscribe to Wired, essential reading in the second Trump term, and not just for its AI coverage.
The Modern Greek φασαρία means, well, big fuss, or has come to under the influence of English, though its evidently originally from Italian meaning nonsense. Also just 'cause we're here in this footnote together - did you know the Greeks say “Είναι κινέζικα για μένα” or “it's Chinese to me” when describing something they don't understand? Just makes me smile.
DeepSeek-VL Towards Real-World Vision-Language Understanding, https://arxiv.org/pdf/2403.05525
DeepSeek LLM - Scaling Open-Source Language Models With Longtermism https://arxiv.org/pdf/2401.02954, January 2024
DeepSeek-V3 Technical Report, https://arxiv.org/2412.19437v1, 27 December, 2024