Copyrighted Books Are Fair Use For AI Training. Here’s What To Know.

GERMANY-TECH-INTERNET-AI-ARTIFICIAL-INTELLIGENCE — The use of AI systems has become part of our daily lives.
AFP via Getty Images

The sudden presence of generative AI systems in our daily lives has prompted many to question the legality of how AI systems are created and used. One question relevant to my practice: Does the ingestion of copyrighted works such books, articles, photographs, and art to train an AI system render the system’s creators liable for copyright infringement, or is that ingestion defensible as a “fair use”?

A court ruling answers this novel question, and the answer is: Yes, the use of copyrighted works for AI training is a fair use – at least under the specific facts of those cases and the evidence presented by the parties. But because the judges in both cases were somewhat expansive in their dicta about how their decisions might have been different, they provide a helpful roadmap as to how other lawsuits might be decided, and how a future AI system might be designed so as not to infringe copyright. The rulings on Meta and Anthropic’s respective cases require some attention. Let’s take a closer look.

More than 30 lawsuits have been filed in the past year or two, in all parts of the nation, by authors, news publishers, artists, photographers, musicians, record companies and other creators against various AI systems, asserting that using the authors’ respective copyrighted works for AI training purposes violates their copyrights. The systems’ owners invariably assert fair use as a defense. They provide a helpful roadmap as to how other lawsuits might be decided, and how a future AI system might be designed so as not to infringe copyright.

The Anthropic Case

Anthropic planned to create a central library of “all the books in the world.”

Getty Images

The first decision, issued in June, involved a lawsuit by three book authors, who alleged that Anthropic PBC infringed the authors’ copyrights by copying several of their books (among millions of others) to train its text generative AI system called Claude. Anthropic’s defense was fair use.

Judge Alsup, sitting the Northern District Court of California, held that the use of the books for training purposes was a fair use, and that the conversion of any print books that Anthropic had purchased and converted to digital was also a fair use. However, Anthropic’s use of pirated digital copies for purposes of creating a central library of “all the books in the world” for uses beyond training Claude, was not a fair use. Whether Anthropic’s copying of its central library copies for purposes other than AI training (and apparently there was some evidence that this was going on, but on a poorly developed record) was left for another day.

It appears that Anthropic decided early on in its designing of Claude that books were the most valuable training materials for a system that was designed to “think” and write like a human. Books provide patterns of speech, prose and proper grammar, among other things. Anthropic chose to download millions of free digital copies of books from pirate sites. It also purchased millions of print copies of books from booksellers, converted them to digital copies and threw the print copies away, resulting in a massive central library of “all the books in the world” that Anthropic planned to keep “forever.” None of this activity was done with the authors’ permission.

Significantly, Claude was designed so that it would not reproduce any of the plaintiffs’ books as output. There was not any such assertion by the plaintiffs, nor any evidence that it did so. The assertions of copyright infringement were, therefore, limited to Claude’s ingestion of the books for training, to build the central library, and for the unidentified non-training purposes. Users of Claude ask it questions and it returns text-based answers. Many users use it for free. Certain corporate and other users of Claude pay to use it, generating over one billion dollars annually in revenue for Anthropic.

The Anthropic Ruling

Both decisions were from the federal district court in Northern California, the situs of Silicon … More Valley.

TNS

To summarize the legal analysis, Judge Alsup evaluated each “use” of the books separately, as it must under the Supreme Court’s 2023 Warhol v. Goldsmith fair use decision. Turning first to the use of the books as training data, Alsup found that the use of the books to train Claude was a “quintessentially” transformative use which did not supplant the market for the plaintiffs’ books, and as such qualified as fair use.

He further found that the conversion of the purchased print books to digital files, where the print copies were thrown away, was also a transformative use akin to the Supreme Court’s 1984 Betamax decision in which the court held that the home recording of free TV programming for time-shifting purposes was a fair use. Here, Judge Alsup reasoned, Anthropic lawfully purchased the books and was merely format-shifting for space and search capability purposes, and, since the original print copy was discarded, only one copy remained (unlike the now-defunct Redigi platform of 2018).

By contrast, the downloading of the over seven million of pirate copies from pirate sites, which at the outset was illegal, for central library uses other than for training purposes could not be held to be a fair use as a matter of law, because the central library use was unjustified and the use of the pirate copies could supplant the market for the original.

Anthropic Is Liable For Unfair Uses – The Cost of Doing Business?

The case will continue on the issue of damages for the pirated copies of the plaintiffs’ books used for central library purposes and not for training purposes. The court noted that the fact that Anthropic later purchased copies of plaintiffs’ books to replace the pirated copies will not absolve it of liability, but might affect the amount of statutory damages it has to pay. The statutory damages range is $750 per copy at a minimum and up to $150,000 per copy maximum.

It tempts one to wonder about all those other millions of copyright owners beyond the three plaintiffs – might Anthropic have to pay statutory damages for seven million copies if the pending class action is certified? Given the lucrativeness of Claude, could that be just a cost of doing AI business?

The Meta Case

Meta’s decision to use shadow libraries to source books was approved by CEO Mark Zuckerberg.

AFP via Getty Images

The second decision, issued two days following the Anthropic decision, on June 25, involves thirteen book authors, most of them famous non-fiction writers, who sued Meta, the creator of a generative AI model called Llama, for using the plaintiffs’ books as training data.

Llama (like Claude), is free to download, but generates billions of dollars for Meta. Like Anthropic, Meta initially looked into licensing rights from book publishers, but eventually abandoned those efforts and instead downloaded the books it desired from pirate sites called “shadow libraries” which were not authorized by the copyright owners to store their works. Also like Claude, Llama was designed not to produce output that reproduced its source material in whole or substantial part, the record indicating that Llama could not be prompted to reproduce more than 50 words from the plaintiffs’ books.

Judge Chhabria, also in the Northern District of California, held Meta’s use of plaintiffs’ works to train Llama was a fair use, but he did so very reluctantly, chiding the plaintiff’s lawyers for making the “wrong” arguments and failing to develop an adequate record. Chhabria’s decision is riddled with his perceptions of the dangers of AI systems potentially flooding the market with substitutes for human authorship and destroying incentives to create.

The Meta Ruling

Based on the parties’ arguments and the record before him, like Judge Alsup, Judge Chhabria found that Meta’s use of the books as training data for Llama was “highly transformative” noting that the purpose of the use of the books – for creating an AI system – was very different than the plaintiffs’ purpose of the books, which was for education and entertainment. Rejecting plaintiff’s argument that Llama could be used to imitate the style of plaintiffs’ writing, Judge Chhabria noted that “style is not copyrightable.”

The fact that Meta sourced the books from shadow libraries rather than authorized copies didn’t make a difference; Judge Chhabria (in my opinion rightly) reasoned that to say that a fair use depends on whether the source copy was authorized begs the question of whether the secondary copying was lawful.

Although plaintiffs tried to make the “central library for other purposes than training” argument that was successful in the Anthropic case, Judge Chhabria concluded that the evidence simply didn’t support that copies were used for purposes other than training, and noted that even if some copies were not used for training, “fair use doesn’t require that the secondary user make the lowest number of copies possible.” Since Llama couldn’t generate exact or substantially similar versions of plaintiffs’ books, he found there was no substitution harm, noting that plaintiffs’ lost licensing revenue for AI training is not a cognizable harm.

Judge Chhabria’s Market Dilution Prediction

Judge Chhabria warns that generative AI systems could dilute the market for lower-value mass market … More publications.

UCG/Universal Images Group via Getty Images

In dicta, clearly expressing frustration with the outcome in Meta’s favor, Judge Chhabria discussed in detail how he thought market harm could – and should – be shown in other cases, through the concept of “market dilution” – warning that a system like Llama, while not producing direct substitutes for a plaintiff’s work, could compete with and thus dilute the plaintiff’s market.

There may be types of works unlike award-winning fictional works more susceptible to this harm, he said, such as news articles, or “typical human-created romance or spy novels.” But since the plaintiffs before him didn’t make those arguments, nor presented any record of the same, he said, he could not make a ruling on the same. This opportunity is left for another day.

AI System Roadmap For Non-Infringement

The court decisions provide an early roadmap as to how to design an AI system.

Getty Images

Based on these two court decisions, here are my take-aways for building a roadmap for a non-infringing generative AI system using books:

The use of copyrighted books for purposes of training data will be a fair use, regardless of whether the source books are pirated or not;
The creation of a central library of books will be a fair use if you purchase lawful copies and do not create additional copies, even if you convert print copies to digital;
The creation of a central library of books where the source material is comprised of unlawful copies will likely not be a fair use;
The system must be designed so that the output of your system does not reproduce any source material books in an exact or substantially similar way (contrast Disney and Universal’s suit against Midjourney);
The system should be careful not to develop output that could demonstrably dilute the market for the source material, e.g., news articles or unremarkable romance novels; and, finally,
It doesn’t matter if your system is commercial and generates billions of dollars; the copyright owners are not entitled to a license fee for fair uses.

Source: https://www.forbes.com/sites/legalentertainment/2025/07/02/copyrighted-books-are-fair-use-for-ai-training-heres-what-to-know/