
Yesterday, Judge Alsup released his decision on Anthropic’s motion for summary judgment in the fast-moving lawsuit it is defending, brought by three book authors on behalf of a class of millions objecting to Anthropic’s use of books for training its LLMs. We’ve recently posted about other aspects of the case related to the class action aspects, which are still pending, and the potential for settlement in this suit.
The decision represents a major win for Anthropic in that the decision found that its training AI on lawfully acquired copyrighted works was a fair use. Anthropic lost, however, on the issue of downloading pirated books to create a “central library” and more is still to come on the issue of Anthropic using those works for AI training.
We will post more analysis later going into what the fair use ruling might mean for researchers and authors, as well as the effects it may have on libraries and others. But for now, we want to highlight some of the most important parts of the decision on the question of whether using books for AI training is fair use, which could affect how authors and researchers think about the rights they have to control downstream uses of their work and to use the works of others in computational and AI research.
How Anthropic acquired its books:
The Court explained:
An artificial intelligence firm downloaded for free millions of copyrighted books in digital form from pirate sites on the internet. The firm also purchased copyrighted books (some overlapping with those acquired from the pirate sites), tore off the bindings, scanned every page, and stored them in digitized, searchable files. All the foregoing was done to amass a central library of “all the books in the world” to retain “forever.” From this central library, the AI firm selected various sets and subsets of digitized books to train various large language models under development to power its AI services.
For the size of the “central library” constructed, Judge Alsup states that Anthropic acquired “at least five million copies” of books from LibGen, and another two million from Pirate Library Mirror (PiLiMi). Anthropic also used Books3 (a well-described data set of about 183,000 books).
Then, Anthropic turned to purchasing and scanning books. “To find a new way to get books, in February 2024, Anthropic hired the former head of partnerships for Google’s book-scanning project, Tom Turvey. He was tasked with obtaining “all the books in the world” while still avoiding as much “legal/practice/business slog.” Turvey initially pursued licensing with publishers, and, according to the court, “[h]ad Turvey kept up those conversations, he might have reached agreements to license copies for AI training from publishers — just as another major technology company soon did with one major publisher. But Turvey let those conversations wither.” The result:
Turvey and his team emailed major book distributors and retailers about bulk purchasing their print copies for the AI firm’s “research library” . Anthropic spent many millions of dollars to purchase millions of print books, often in used condition. Then, its service providers stripped the books from their bindings, cut their pages to size, and scanned the books into digital form — discarding the paper originals. Each print book resulted in a PDF copy containing images of the scanned pages with machine-readable text (including front and back cover scans for softcover books). Anthropic created its own catalog of bibliographic metadata for the books it was acquiring.
How the Court described Anthropic’s uses of the books:
The court described Anthropic’s use as actually two sets of uses, 1) one to build its central library, and 2) to actually train its LLM:
“Anthropic planned to “store everything forever; we might separate out books into categories[, but t]here [wa]s no compelling reason to delete a book” — even if not used for training LLMs. Over time, Anthropic invested in building more tools for searching its “general purpose” library and for accessing books or sets of books for further uses.”
One further use was training LLMs. As a preliminary step towards training, engineers browsed books and bibliographic metadata to learn what languages the books were written in, what subjects they concerned, whether they were by famous authors or not, and so on — sometimes by “open[ing] any of the books” and sometimes using software. From the library copies, engineers copied the sets or subsets of books they believed best for training and “iterate[d]” on those selections over time. For instance, two different subsets of print-sourced books were included in “data mixes” for training two different LLMs”
Fair Use
The first fair use factor examines the purpose and character of the use, which asks courts to evaluate
Training LLMs – “Transformative – spectacularly so”:
Anthropic used copies of Authors’ copyrighted works to iteratively map statistical relationships between every text-fragment and every sequence of text-fragments so that a completed LLM could receive new text inputs and return new text outputs as if it were a human reading prompts and writing responses. Authors further argue — and this order takes for granted — that such training entailed “memoriz[ing]” works by “compress[ing]” copies of those works into the LLM Regardless, the “purpose and character” of using works to train LLMs was transformative — spectacularly so.
Anthropic’s LLMs have not reproduced to the public a given work’s creative elements, nor even one author’s identifiable expressive style (assuming arguendo that these are even copyrightable). Yes, Claude has outputted grammar, composition, and style that the underlying LLM distilled from thousands of works. But if someone were to read all the modern-day classics because of their exceptional expression, memorize them, and then emulate a blend of their best writing, would that violate the Copyright Act? Of course not. Copyright does not extend to “method[s] of operation, concept[s], [or] principle[s]” “illustrated[ ] or embodied in [a] work.”
The purpose and character of using copyrighted works to train LLMs to generate new text was quintessentially transformative. Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different. If this training process reasonably required making copies within the LLM or otherwise, those copies were engaged in a transformative use
Copies to build a “central library” for future use:
For the purchased and scanned copies, the court explained:
Anthropic purchased millions of print copies to “build a research library” It destroyed each print copy while replacing it with a digital copy for use in its library (not for sharing nor sale outside the company). As to these copies, Authors do not complain that Anthropic failed to pay to acquire a library copy. Authors only complain that Anthropic changed each copy’s format from print to digital. On the facts here, that format change itself added no new copies, eased storage and enabled searchability, and was not done for purposes trenching upon the copyright owner’s rightful interests — it was transformative. Anthropic purchased its print copies fair and square.
Storage and searchability are not creative properties of the copyrighted work itself but physical properties of the frame around the work or informational properties about the work.
But for the pirated copies, the court said:
The person who copies the textbook from a pirate site has infringed already, full stop. This order further rejects Anthropic’s assumption that the use of the copies for a central library can be excused as fair use merely because some will eventually be used to train LLMs.
Anthropic did not use these copies only for training its LLM. Indeed, it retained pirated copies even after deciding it would not use them or copies from them for training its LLMs ever again. They were acquired and retained, as a central library of all the books in the world.
Bad faith is not the basis for this decision. Each use of a work must be analyzed objectively.. The objective analysis here shows the initial copies were pirated to create a central, general-purpose library, as a substitute for paid copies to do the same thing.
Here, what Anthropic said about its acquisitions at the time — that they were made to “build[ ] a research library” while avoiding a “huge legal/practice/business slog” — are relevant in this regard. And, Anthropic’s actual use of these pirated copies was to create its central library of texts that, like any university or corporate library, stored the works’ well-organized facts, analyses, and expressive examples for various contingent uses, one being training.
The first factor points against fair use for the central library copies made from pirated sources — and no damages from pirating copies could be undone by later paying for copies of the same works.
On harm to the market, the court stated:
The copies used to train specific LLMs did not and will not displace demand for copies of Authors’ works, or not in the way that counts under the Copyright Act. Again, Authors concede that training LLMs did not result in any exact copies nor even infringing knockoffs of their works being provided to the public. If that were not so, this would be a different case. Authors remain free to bring that case in the future should such facts develop.
Authors contend generically that training LLMs will result in an explosion of works competing with their works — such as by creating alternative summaries of factual events, alternative examples of compelling writing about fictional events, and so on. This order assumes that is so But Authors’ complaint is no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works. This is not the kind of competitive or creative displacement that concerns the Copyright Act. The Act seeks to advance original works of authorship, not to protect authors against competition.
Authors next contend that training LLMs displaced (or will) an emerging market for licensing their works for the narrow purpose of training LLMs. Anthropic argues that transactional costs would exceed Anthropic’s expected benefit from any such bargain, prompting it to cease dealing with any rightsholders or else to cease developing such technology altogether. Our record could support either account — so this order must assume Authors are correct. A market could develop. Even so, such a market for that use is not one the Copyright Act entitles Authors to exploit.
For the books Anthropic purchased, the court stated:
This order assumes Anthropic’s format change from print to digital displaced purchases of new digital copies that Anthropic would have made directly from Authors (had it not been able to purchase print copies in used condition). But for reasons stated under the first factor, such losses did not relate to something the Copyright Act reserves for Authors to exploit. It was a format change. The format change did not itself usurp the Authors’ rightful entitlements. This factor is thus neutral for the purchased library copies converted from print to digital.
But for the pirated books, the court concluded: The copies used to build a central library and that were obtained from pirated sources plainly displaced demand for Authors’ books — copy for copy. Not every person who merely intends to make a fair use of a work is thereby entitled to a full copy in the meantime, nor even to steal a copy so that achieving this fair use is especially simple or cost-effective.
The End Result
The copies used to train specific LLMs were justified as a fair use. . . .
The copies used to convert purchased print library copies into digital library copies were justified, too, though for a different fair use. . . .
The downloaded pirated copies used to build a central library were not justified by a fair use. Every factor points against fair use . . . .
[A]s for any copies made from central library copies but not used for training, this order does not grant summary judgment for Anthropic. On this record in this posture, the central library copies were retained even when no longer serving as sources for training copies, “hundreds of engineers” could access them to make copies for other uses, and engineers did make other copies. Anthropic has dodged discovery on these points We cannot determine the right answer concerning such copies because the record is too poorly developed as to them. Anthropic is not entitled to an order blessing all copying “that Anthropic has ever made after obtaining the data,” to use its words.
Discover more from Authors Alliance
Subscribe to get the latest posts sent to your email.