The Salesforce AI lawsuit alleges Salesforce built its XGen models using pirated book corpora (Books3 via RedPajama and The Pile), prompting authors to seek class certification, statutory damages, and destruction of infringing copies. The key issue is whether authors can prove concrete market harm under recent fair-use rulings.
Published: 2025-10-17 | Updated: 2025-10-17 | Author: COINOTAG
Allegation: Salesforce trained XGen on copyrighted books from Books3/RedPajama/The Pile.
Authors E. Molly Tanzer and Jennifer Gilmore seek class certification and statutory damages under the Copyright Act.
Recent judge rulings (e.g., Judge Vince Chhabria) on Meta, OpenAI, and Anthropic make market-harm proof decisive.
Salesforce AI lawsuit: Authors allege Salesforce trained XGen on pirated books and seek damages and class certification. COINOTAG’s concise factual breakdown.
What is the Salesforce AI lawsuit?
The Salesforce AI lawsuit is a federal copyright action filed in San Francisco by authors E. Molly Tanzer and Jennifer Gilmore alleging Salesforce used copies of copyrighted books to train its XGen family of large language models. The complaint claims ongoing infringement and seeks class certification, statutory damages, return of profits, destruction of infringing copies, and attorneys’ fees.
How did the plaintiffs describe the alleged training data sources?
The complaint states Salesforce relied on datasets known in the research community as RedPajama and The Pile, specifically a book corpus called Books3 that contains more than 196,000 books originally copied from the private tracker Bibliotik. Plaintiffs say Salesforce initially referenced “RedPajama-Books” when launching XGen in June 2023 and that references were later removed or generalized to “publicly available sources.” Hugging Face removed Books3 from its hosting amid copyright concerns, according to public reporting and the court filing.
What remedies are the authors seeking and why do judges matter?
The plaintiffs seek class certification for U.S. copyright holders whose works were allegedly used since October 2022, statutory damages, destruction of infringing copies, disgorgement of profits, and a declaration of willful infringement. Court decisions in related cases—most notably rulings involving Meta, OpenAI, and Anthropic—have emphasized that authors must demonstrate concrete market harm rather than merely show their works were included in training data. Judge Vince Chhabria’s decision in the Meta matter found that mere use is insufficient and applied a fair-use analysis, establishing a legal threshold this lawsuit must address.
Frequently Asked Questions
Will the Salesforce AI lawsuit hinge on demonstrating market harm?
Yes. Based on recent federal decisions, plaintiffs must present evidence that Salesforce’s use of copyrighted books caused measurable market harm or substituted for the authors’ works. Courts have weighed whether model training and downstream outputs usurp existing markets or whether use qualifies as fair use under statutory factors.
What precedent do other AI-related copyright cases provide?
Recent rulings favored defendants in several high-profile suits—judges found plaintiffs failed to prove market harm in cases involving OpenAI and Anthropic. At the same time, courts have criticized defendants for keeping persistent libraries of copyrighted material. These outcomes show courts are scrutinizing both the nature of the datasets (e.g., Books3/RedPajama/The Pile) and whether downstream commercialization harms rights holders.
Context and factual timeline
Key factual points reflected in the complaint and public disclosures include:
June 2023: Salesforce launched XGen and referenced RedPajama-Books among training sources.
September–October 2023: Salesforce removed specific dataset references; Hugging Face removed Books3 for copyright concerns.
2022–2024: The complaint alleges Salesforce trained CodeGen on The Pile and later marketed models through Agentforce and XGen-Sales (XGen-Sales released October 2024).
January 2024: Salesforce CEO Marc Benioff publicly said AI companies “ripped off” training data (statement cited in the complaint).
Expert commentary and official statements
Salesforce chief scientist Silvio Savarese is quoted regarding enterprise consistency and capabilities of AI agents, stating the partnership with Google to integrate Gemini models will set a new standard for agentic enterprise workflows. The complaint also references a Bloomberg interview with Marc Benioff criticizing industry training practices. Legal experts responding in public reporting emphasize that courts will require concrete proof of economic harm to authors for statutory remedies to succeed.
Key Takeaways
- Alleged dataset use: Plaintiffs claim Salesforce used Books3/RedPajama/The Pile—datasets tied to Bibliotik copies—to train XGen.
- Legal hurdle: Proving tangible market harm is decisive given recent rulings; prior cases involving Meta, OpenAI, and Anthropic provide important precedent.
- Enterprise impact: The case may influence how vendors document training data and how enterprises deploy commercial AI models.
Conclusion
The Salesforce AI lawsuit places a major enterprise vendor at the center of a broader legal debate about AI training data, copyright, and market harm. The outcome will depend on the courts’ assessment of evidence that authors’ markets were harmed and whether use qualifies as fair use. COINOTAG will monitor filings and rulings; stakeholders should expect renewed scrutiny of dataset provenance and documentation as litigation proceeds.