Apr 05 · Primer · 6 min read

Fair use and AI training: what courts have actually ruled

Three years into the AI copyright wars, the doctrinal map is more developed than commentary suggests. Here is the U.S. fair-use scoreboard, holding-by-holding.

The standard framing of fair use and AI training is that the law is a black box and we will not know the answer for years. That framing is now wrong. Several U.S. courts have issued substantive rulings on fair use as applied to AI training data; an initial doctrinal pattern has emerged. The cases do not all point the same direction, but they are not random. This is the scoreboard.

Thomson Reuters v. Ross Intelligence (D. Del., Feb 2025)

The first U.S. summary-judgment ruling on AI training was for the plaintiff. Judge Stephanos Bibas held that Ross's use of Westlaw headnotes — to train a non-generative legal-research tool intended to compete with Westlaw — was not fair use. The opinion rests heavily on the fourth factor: market substitution. Ross was building the same kind of product Westlaw sells; the training was direct economic substitution; the transformativeness analysis was unimpressed.

Ross is not directly applicable to large generative models, because the use case was so narrowly competitive. But the opinion's framing — that fair use weakens sharply when training material substitutes for, rather than complements, the rights holder's market — has been picked up by every plaintiff in the books and news cases since.

Bartz v. Anthropic (N.D. Cal., Jun 2025)

The bifurcated ruling in Bartz is now the most-cited fair-use opinion in AI litigation. Judge William Alsup separated two distinct uses. Training a model on legitimately acquired books, he held, may be fair use: the analysis weighed transformativeness heavily and treated the model's purpose as functionally different from the original works. Acquiring and retaining pirated copies of the same books from shadow libraries was not fair use, regardless of downstream training purpose.

The doctrinal innovation here is that fair use is sensitive to provenance. The same downstream use can be lawful or unlawful depending on how the training corpus was assembled. After Bartz, every AI lab in the country is auditing its training data sources. Concord Music v. Anthropic, Kadrey v. Meta, and the OpenAI books cases all now turn substantially on this provenance question.

Kadrey v. Meta (N.D. Cal., 2025)

Judge Vince Chhabria reached a similar provenance-driven outcome. Meta obtained partial summary judgment on a record limited to legitimately acquired Llama training material. Plaintiffs' DMCA §1202 (copyright-management-information) claims and the pirated-corpus theory survived. The opinion read together with Bartz produces a coherent if uncomfortable doctrine: training on lawfully acquired material may be transformative fair use; training on pirated material is not, and §1202 violations create a separate liability track.

NYT v. OpenAI (S.D.N.Y., 2024–present)

NYT v. OpenAI has not yet produced a fair-use ruling. The April 2024 motion-to-dismiss order denied OpenAI's request to dismiss the core direct-infringement and §1202 claims. The 2025 magistrate-judge order requiring preservation of roughly 20 million ChatGPT and API conversation logs has reshaped discovery in every other publisher case.

What the NYT case will produce, on summary judgment, is a fair-use opinion specifically about news content. The fair-use analysis for news differs in important ways from that for books: news fact-content is lightly protected, but expression in news writing is fully protected, and the substitution analysis is more direct because Times-quality summarization is itself a service the Times sells.

Authors Guild v. OpenAI (S.D.N.Y., Oct 2025)

The October 2025 ruling in Authors Guild v. OpenAI shifted the doctrinal terrain. The court denied OpenAI's motion to dismiss the output-infringement theory: short, plot-level summaries of plaintiffs' novels may infringe even when they do not quote verbatim. That holding does not address training-data fair use. It does, however, materially complicate the cleanest defense in the books cases — that any infringement, if it exists, is at the input rather than the output level.

What the pattern says

Three things, conservatively. First, courts are willing to find training-data fair use, but only on a clean record: legitimately acquired corpus, transformative downstream purpose, no DMCA §1202 problems, no direct market substitution. The Bartz/Kadrey path exists. Second, courts are unwilling to find training-data fair use when the corpus is pirated. The provenance question now functionally precedes the four-factor analysis. Third, the locus of dispute is migrating from training inputs to model outputs. Authors Guild and GEMA point in the same direction. The next two years of AI copyright litigation will be substantially about what the model produces, not what it learned from.

None of this is settled appellate law. The Ninth Circuit has not spoken; the Second Circuit will speak first, on the NYT case. Until then, the district-court doctrine sketched above is what AI labs and rights holders are operating under.