Category Archives: Law and Policy

On The New NIH Indirect Cost Guidance

Posted February 18, 2025
Photo of an emergency room with multiple "emergency" signs; red color enhanced.
NIH cuts are an emergency for hospitals (Photo: Eric Harbeson, CC-BY)

A little over a week ago, the National Institutes of Health issued a new guidance policy on indirect costs in Federal grant awards. Presently, NIH negotiates the indirect cost rate with individual institutions through a carefully regulated process that ensures an appropriate rate for a given institution’s unique circumstances, while also providing robust safeguards and auditing requirements to ensure that the rate is no greater than necessary. The new policy—similar to what the previous Trump administration proposed in 2017—would replace the negotiated rates with a standard rate of 15%. For comparison, the average rate among grantee institutions is around 27%, and many of the top research institutions currently have negotiated rates exceeding 50% or even 60%, which amount to tens of millions of dollars in some cases. The rate cap would apply both prospectively to new grants, as well as to all in-progress grants.

Indirect costs are the institutional expenditures that cannot be attributed to a particular research project. These are the costs of keeping the lights on, and the lab clean, and the MRI machine running. They pay for biocontainment labs, or clinical testing facilities, or computer systems to analyze data, facilities each of which might be shared by multiple NIH-funded projects. Though indirect, they are significant costs incurred by the institution and are an unavoidable part of conducting grant-funded research. From a government efficiency standpoint they are also highly desirable, in that they reduce unnecessary redundancy as well as exceedingly time-consuming and expensive bookkeeping.

Support for indirect costs in grant funds is essential to institutions’ ability to take part in Federal grant-making. If the new guidance policy is allowed to stand, universities collectively expect to lose many hundreds of millions of dollars from the move, losses which in turn will lead to decreases in important, sometimes life-saving research. This new policy has raised serious concerns among affected institutions. 

To say things are moving quickly in Washington, these days, would be an understatement. The administration has, of course, been releasing a flurry of sometimes sweeping executive orders. The pace is dizzying. In this case, in the space of just four days—two of which were a weekend—the NIH issued its guidance; at least three different lawsuits were filed, each in the District of Massachusetts; and a judge entered a temporary restraining order on the guidance. A hearing on the restraining order is scheduled for February 21 (the cases have not yet been consolidated, though they almost certainly will be if they proceed).

In our view, there are multiple clear violations of law in the guidance, both of statute and of the Constitution. While we await the hearing, we thought it worthwhile to highlight to authors some of legal challenges it will face. Many others have already written on this topic—for more responses to the guidance policy, we recommend COGR’s collection of responses from the grantee community, as well as this post by Holden Thorpe in Science and this post from Lisa Janicke Hinchliffe in Scholarly Kitchen (which draws important connections to scholarly publishing). 

Some Fact-Checking

At the outset, an examination of the issuing guidance reveals holes in the chain of authority that anticipate problems with the new order. For example, the guidance asserts that “NIH may, however, use ‘a rate different from the negotiated rate for either a class of Federal awards or a single Federal award.’ 45 C.F.R. 75.414(c)(1).” The citation at the end refers to Title 45, part 75 of the Code of Federal Regulations, where NIH’s parent agency, the Department of Health and Human Services (HHS), codifies its grant guidelines. Here is the entire paragraph:

“Negotiated indirect cost rates must be accepted by all Federal agencies. A Federal agency may use a rate different from the negotiated rate for either a class of Federal awards or a single Federal award only when required by Federal statute or regulation, or when approved by the awarding Federal agency in accordance with paragraph (c)(3) of this section.” 45 C.F.R. § 75.414(c)(1) (emphasis added).

Note that this paragraph doesn’t say NIH generally may use a different rate, as the guidance appears to claim. Rather, it states the exception—they may not do so unless they are required to by statute or another regulation. Alternatively, under paragraph (c)(3) of the regulation, NIH must “implement, and make publicly available, the policies, procedures and general decision making criteria that their programs will follow to seek and justify deviations from negotiated rates.” (emphasis added). The paragraph doesn’t give NIH general permission, it constrains them.

The notice’s very next sentence provides arguably its most egregious claim, namely that the cap may be applied retroactively to existing grants, in defiance of institutions’ reliance on their contractually negotiated rates. The notice states that “​​NIH may deviate from the negotiated rate both for future grant awards and, in the case of grants to institutions of higher education (‘IHEs’), for existing grant awards. See 45 CFR Appendix III to Part 75, § C.7.a; see 45 C.F.R. 75.414(c)(1).” The citation, to Appendix III of Part 75, purports to support the claim that NIH may unilaterally, and retroactively, alter the terms of a contract. Here is the cited paragraph, in its entirety: 

“Except as provided in paragraph (c)(1) of § 200.414, Federal agencies must use the negotiated rates in effect at the time of the initial award throughout the life of the Federal award. Award levels for Federal awards may not be adjusted in future years as a result of changes in negotiated rates. “Negotiated rates” per the rate agreement include final, fixed, and predetermined rates and exclude provisional rates. “Life” for the purpose of this subsection means each competitive segment of a project. A competitive segment is a period of years approved by the Federal awarding agency at the time of the Federal award. If negotiated rate agreements do not extend through the life of the Federal award at the time of the initial award, then the negotiated rate for the last year of the Federal award must be extended through the end of the life of the Federal award.” (emphasis added)

Once again, the cited text not only does not support the claim, but if anything forecloses it. This paragraph does not purport to give permission to change an existing agreement. To the contrary, the paragraph requires NIH to respect the negotiated rate for the life of the award. (Sec. 200.414(c)(1), referenced in the appendix, points to the OMB Uniform Guidance, and is essentially the same as HHS’s Sec. 75.414(a), which is discussed above). 

The end result is that the notice rests its legal authority to carry out the policy on regulations that in fact work against the new policy. Not a great start.

Violation of law and policy

Though federal agencies are ultimately under the direction of the President, this does not give the executive branch unfettered authority to dictate an agency’s policies. Agencies act as agents for carrying out the laws passed by Congress. This means that Congress has the last word as to what an agency is authorized to do or not do, or must do or not do. In fact, every act of an agency must, in some way, be tied to an act of Congress (admittedly, the connection is often fairly loose).

Congress has actually prohibited the president—this president—from capping the negotiated indirect cost rates. In 2017, when the president pressed Congress to limit indirect costs to 10% of the grant award, Congress not only rejected the idea, but in Sec. 226 of the Consolidated Appropriations Act of 2018 (p.394) they forbade the president from pursuing the policy. Under Sec. 226 rider, Congress provided that the existing regulations pertaining to indirect costs are to continue, and that the department may not expend funds in pursuing a policy to the contrary. The rider has persisted in every appropriations bill since, including the most recent one.

The policy also is contrary to HHS’s own regulations that govern new policies such as this one. The notice purports to “implement, and make publicly available, the policies, procedures and general decision making criteria” as required by 45 CFR 75.414(c)(3) (discussed above), but in fact it only satisfies one of the three requirements. The notice publishes a policy (the 15% rate cap), but it does not make the procedures or general criteria available as required by the regulation. And publication must occur prior to the policies’ effective date, not simultaneously with it.

Under the Administrative Procedure Act (APA), in place since 1946, Congress has established the courts’ jurisdiction to review agency actions, such as this one, and to “decide all relevant questions of law.” The courts are empowered to set aside agency actions that are not in accordance with law, whether because they are contrary to the agency’s own regulations, acts of Congress, or the Constitution. As the three complaints observe, Congress has forbidden NIH from changing the system of negotiated indirect costs, and the new policy is also in violation of the agency’s own regulations.

Constitutional violations

The Constitution also has something to say about the guidance. In addition to the separation of powers problems, related to Congress’s actions discussed above, the retroactive nature of the guidance raises problems under the Fifth Amendment’s due process and takings clauses. These problems arise because the guidance professes to alter the indirect costs for existing grants, effectively unilaterally rewriting the grant agreements without regard to the institutions’ justified reliance on the binding nature of the agreements.

Contracts are a form of property, and contracts are binding on the U.S. Government to the same extent that they are on private parties. Though grant agreements are not formally contracts per se, the Supreme Court has observed that legislation enacted under the Spending Clause, as all grants are, is “much in the nature of a contract.”  The grant agreements bind the grantee institution to numerous terms and conditions (some of which could be said to be consideration for the award) in return for federal financial support for the project. The grant agreements are clearly binding on both parties, and renegotiation of a contract requires consent of both parties.

States, for their part, are explicitly forbidden from legislating their way out of contractual agreements, such as the NIH purports to do, under the Constitution’s Contracts Clause, but that clause does not apply to the Federal government. Still, the Federal government is prohibited, under the Fifth Amendment, from taking private property (and again, contracts are property) for public use without just compensation, and from depriving a party of property (for any purpose) “without due process of law.” Grantee institutions rely on the government’s promise to follow through on the agreed upon, negotiated indirect cost rate, and that reliance interest is in some cases hundreds of millions of dollars. NIH’s implementing this new policy, and with no notice (much less a hearing or opportunity to comment as the APA would require) sounds a lot like deprivation of property without due process.

Conclusion

NIH-funded research has produced an astonishing amount of highly significant, impactful research, and its role in the biomedical research ecosystem is pivotal. The authors NIH has funded have won every major prize in the field many times over, and their research has saved and improved countless lives. But NIH’s track record is only as strong as its grantees—authors who do the research and the institutions that employ them. If NIH is permitted to recklessly cut its promised support to those grantees, the inevitable resulting loss of research will be a great detriment to the scientific community, both home and abroad, and to Americans in general.

Thomson Reuters v. Ross: The First AI Fair Use Ruling Fails to Persuade

Posted February 13, 2025
A confused judge, generated by Gemini AI

Facts of the Case

On February 11, Third Circuit Judge Stephanos Bibas (sitting by designation for the U.S.  District Court of Delaware) issued a new summary judgment ruling in Thomson Reuters v. ROSS Intelligence. He overruled his previous decision from 2023 which held that a jury must decide the fair use question. The decision was one of the first to address fair use in the context of AI, though the facts of this case differ significantly from the many other pending AI copyright suits. 

This ruling focuses on copyright infringement claims brought by Thomson Reuters (TR), the owner of Westlaw, a major legal research platform, against ROSS Intelligence. TR alleged that ROSS improperly used Westlaw’s headnotes and the Key Number System to train its AI system to better match legal questions with relevant case law. 

Westlaw’s headnotes summarize legal principles extracted from judicial opinions. (Note: Judicial opinions are not copyrightable in the US.) The Key Number System is a numerical taxonomy categorizing legal topics and cases. Clicking on a headnote takes users to the corresponding passage in the judicial text. Clicking on the key number associated with a headnote takes users to a list of cases that make the same legal point. 

Importantly, ROSS did not directly ingest the headnotes and the Key Number System to train its model. Instead, ROSS hired LegalEase, a company that provides legal research and writing services, to create training data based on the headnotes and the Key Number System. LegalEase created Bulk Memos—a collection of legal questions paired with four to six possible answers. LegalEase instructed lawyers to use Westlaw headnotes as a reference to formulate the questions in Bulk Memos. LegalEase instructed the lawyers not to copy the headnotes directly. 

ROSS attempted to license the necessary content directly from TR, but TR refused to grant a license because it thought the AI tool contemplated by ROSS would compete with Westlaw.

The financial burden of defending this lawsuit has caused ROSS to shut down its operations. ROSS has countered TR’s copyright infringement claims with antitrust claims but the claims were dismissed by the same Judge. 

The New Ruling

The court found that ROSS copied 2,243 headnotes from Westlaw. The court ruled that these headnotes and the Key Number System met the low legal threshold for originality and were copyrightable. The court rejected the merger and scenes à faire defense by ROSS, because, according to the court, the headnotes and the Key Number System were not dictated by necessity. The court also rejected ROSS’s fair use defense on the grounds that the 1st and 4th factors weighed in favor of TR. At this point, the only remaining issue for trial is whether some headnotes’ copyrights had expired or were untimely registered.

The new ruling has drawn mixed reactions—some saying it undermines potential fair use defenses in other AI cases, while others dismiss its significance since its facts are unique. In our view, the opinion is poorly reasoned and disregards well-established case law. Future AI cases must demonstrate why the ROSS Court’s approach is unpersuasive. Here are three key flaws we see in the ruling.   

Problems with the Opinion

  1. Near-Verbatim Summaries are “Original”?

“A block of raw marble, like a judicial opinion, is not copyrightable. Yet a sculptor creates a sculpture by choosing what to cut away and what to leave in place. … A headnote is a short, key point of law chiseled out of a lengthy judicial opinion.” 

— the ROSS court

(↑example of a headnote and the uncopyrightable judicial text the headnote was based on↑)

The court claims that the Westlaw headnotes are original both individually and as a compilation, and the Key Number System is original and protected as a compilation. 

“Original” has a special meaning in US copyright law: It means that a work has a modicum of human creativity that our society would want to protect and encourage. Based on the evidence that survived redaction, it is near impossible to find creativity in any individual headnotes. The headnotes consist of verbatim copying of uncopyrightable judicial texts, along with some basic paraphrasing of facts. 

As we know, facts are not copyrightable, but expressions of facts often are. One important safeguard for protecting our freedom to reference facts is the merger doctrine. US law has long recognized that when there are only limited ways to express a fact or an idea, those expressions are not considered “original.” The expressions “merge” with the underlying unprotectable fact, and become unprotectable themselves. 

Judge Bibas gets merger wrong—he claims merger does not apply here because “there are many ways to express points of law from judicial opinions.” This view misunderstands the merger doctrine. It is the nature of human language to be capable of conveying the same thing in many different ways, as long as you are willing to do some verbal acrobatics. But when there are only a limited number of reasonable, natural ways to express a fact or idea—especially when textual precision and terms of art are used to convey complex ideas—merger applies. 

There are many good reasons for this to be the law. For one, this is how we avoid giving copyright protection to concise expression of ideas. Fundamentally, we do not need to use copyright to incentivize the simple restatement of facts. As the Constitution intended, copyright law is designed to encourage creativity, not to grant exclusive rights to basic expressions of facts. We want people to state facts accurately and concisely. If we allowed the first person to describe a judicial text in a natural, succinct way to claim exclusive rights over that expression, it would hinder, rather than facilitate, meaningful discussion of said text, and stifle blog posts like this one. 

As to the selection and arrangement of the Key Number System, the court claims that originality exists here, too, because “there are many possible, logical ways to organize legal topics by level of granularity,” and TR exercised some judgment in choosing the particular “level” with its Key Number System. However, the cases are tagged with Key Number System by an automated computer system, and the topics closely mirror what law schools teach their first-year students. 

The court does not say much about why the compilation of the headnotes should receive separate copyright protection, other than that it qualifies as original “factual compilations.” This claim is dubious because the compilation is of uncopyrightable materials, as discussed, and the selection is driven by the necessity to represent facts and law, not by creativity. Even if the compilation of headnotes is indeed copyrightable, using portions of it that are uncopyrightable is decidedly not an infringement, because the US does not protect sui generis database rights.

  1. Can’t Claim Fair Use When Nobody Saw a Copy?

 “[The intermediate-copying cases] are all about copying computer code. This case is not.” 

— the ROSS court conveniently ignoring Bellsouth Advertising & Publishing Corp. v. Donnelley Information Publishing, Inc., 933 F.2d 952 (11th Cir. 1991) and Sundeman v. Seajay Society, Inc., 142 F. 3d 194 (4th Cir. 1998).

In deciding whether ROSS’s use of Westlaw’s headnotes and the Key Number System is transformative under the 1st factor, the court took a moment to consider whether the available intermediate copying case law is in favor of ROSS, and quickly decided against it. 

Even though no consumer ever saw the headnotes or the Key Number System in the AI products offered by ROSS, the court claims that the copying of these constitutes copyright infringement because there existed an intermediate copy that contained copyright-restricted materials authored by Westlaw. And, according to the court, intermediate copying can only weigh in favor of fair use for computer codes.

Before turning to the actual case law the court is overlooking here, we wonder if Judge Bibas is in fact unpersuaded by his own argument: under the 3rd fair use factor, he admits that only the content made accessible to the public should be taken into consideration when deciding what amount is taken from a copyrighted work compared to the copyrighted work as a whole, which is contrary to what he argues under the 1st factor—that we must examine non-public intermediate copies. 

Intermediate copying is the process of producing a preliminary, non-public work as an interim step in the creation of a new public-facing work. It is well established under US jurisprudence that any type of copying, whether private or public, satisfies a prima facie copyright infringement claim, but, the fact that a work was never shared publicly—nor intended to be shared publicly—strongly favors fair use. For example, in Bellsouth Advertising & Publishing Corp. v. Donnelley Information Publishing, Inc., the 11th Circuit Court decided that directly copying a competitor’s yellow pages business directory in order to produce a competing yellow pages was fair use when the resulting publicly accessible yellow pages the defendant created did not directly incorporate the plaintiff’s work. Similarly, in Sundeman v. Seajay Society, Inc., the Fourth Circuit concluded that it was fair use when the Seajay Society made an intermediary, entire copy of plaintiffs’ unpublished manuscript for a scholar to study and write about it. The scholar wrote several articles about it mostly summarizing important facts and ideas (while also using short quotations).  

There are many good reasons for allowing intermediate copying. Clearly, we do not want ALL unlicensed copies to be subject to copyright infringement lawsuits, particularly when intermediate copies are made in order to extract unprotectable facts or ideas. More generally, intermediate copying is important to protect because it helps authors and artists create new copyrighted works (e.g., sketching a famous painting to learn a new style, translating a passage to practice your language skills, copying the photo of a politician to create a parody print t-shirt). 

  1. Suddenly, We Have an AI Training Market?

“[I]t does not matter whether Thomson Reuters has used [the headnotes and the Key Number System] to train its own legal search tools; the effect on a potential market for AI training data is enough.”

 — the ROSS court

The 4th fair use factor is very much susceptible to circular reasoning: if a user is making a derivative use of my work, surely that proves a market already exists or will likely develop for that derivative use, and, if a market exists for such a derivative use, then, as the copyright holder, I should have absolute control over such a market.

The ROSS court runs full tilt into this circular trap. In the eyes of the court, ROSS, by virtue of using Westlaw’s data in the context of AI training, has created a legitimate AI training data market that should be rightfully controlled by TR.

Only that our case law suggests the 4th factor “market substitution” considers only markets which are traditional, reasonable or likely to be developed. As we have already pointed out in a previous blog post, copyright holders must offer concrete evidence to prove the existence, or likelihood of developing, licensing market, before they can argue a secondary use serves as “market substitute.” If we allowed a copyright holder’s protected market to include everything that he’s willing to receive licensing fees for, it will all but wipe out fair use in the service of stifling competition. 

Conclusion

The impact of this case is currently limited, both because it is a district court ruling and because it concerns non-generative AI. However, it is important to remain vigilant, as the reasoning put forth by the ROSS court could influence other judges, policymakers, and even the broader public, if left unchallenged.

This ruling combines several problematic arguments that, if accepted more widely, could have significant consequences. First, it blurs the line between fact and expression, suggesting that factual information can become copyrightable simply by being written down by someone in a minimally creative way. Second, it expands copyright enforcement to intermediate copies, meaning that even temporary, non-public use of copyrighted material could be subject to infringement claims. Third, it conjures up a new market for AI training data, regardless of whether such a licensing market is legitimate or even likely to exist.

If these arguments gain traction, they could further entrench the dominance of a few large AI companies. Only major players like Microsoft and Meta will be able to afford AI training licenses, consolidating control over the industry. The AI training licensing terms will be determined solely between big AI companies and big content aggregators, without representation of individual authors or public interest.  The large content aggregators will get to dictate the terms under which creators must surrender rights to their works for AI training, and the AI companies will dictate how their AI models can be used by the general public. 

Without meaningful pushback and policy intervention, smaller organizations and individual creators cannot participate fairly. Let’s not rewrite our copyright laws to entrench this power imbalance even further.

AUTHORS ALLIANCE SUBMITS AMICUS BRIEF IN SEDLIK v. DRACHENBERG

Posted December 23, 2024
Kat Von D tracing the image of Miles Davis
in preparation for inking the tattoo

Although tattoos have existed for as long as human’s written history, legal disputes involving tattoos are a relatively new phenomenon. The case Sedlik v. Drachenberg, currently pending before the 9th Circuit, is particularly notable, as it marks the first instance of a court ruling on an artist’s use of copyrighted imagery in her tattoo art. 

More importantly, the case presents the 9th Circuit a first opportunity to interpret the fair use right in the wake of the Supreme Court’s 2023 Warhol decision. Authors Alliance has been closely monitoring circuit courts’ rulings on fair use and advocating for a proper interpretation of Warhol—including challenging the problematic fair use ruling issued by the 10th Circuit earlier this year, a decision that was later vacated in response to strong pushback from fair use advocates.

At the heart of the Sedlik v. Drachenberg legal debate are two creative professionals with very different backgrounds: 

The plaintiff in this case is Jeffery Sedlik. Sedlik is a successful professional photographer. He took a photo of the Jazz legend Miles Davis in 1989—an image that is at the focal point of the pending dispute. 

The defendant, Kat Von Drachenberg (“KVD”), is a celebrity tattoo artist. In recent years, she has shifted away from for-profit tattooing, opting instead to ink clients for free. In 2017, she freehand-tattooed Miles Davis on a client’s arm, largely drawing from the 1989 photograph captured by Sedlik.

Interestingly, neither party is new to the world of litigation. Sedlik has established a reputation for aggressive copyright enforcement—even filing a case with the Copyright Claims Board on its first day of operation. KVD, on the other hand, was sued by a former employee in 2022. 

Sedlik’s claims were straightforward—he alleges that KVD’s tattoo, as well as her social media posts documenting the process of her creating the tattoo, infringe his copyright in the Miles Davis photo.

For Sedlik to state a prima facie case of copyright infringement, he must prove that KVD had access to the Miles Davis photo (which is easy to prove in this case), and that the allegedly infringing tattoo and social media posts are substantially similar to the plaintiff’s photo. In this case, the district court left the question of substantial similarity and fair use to the jury, after refusing the motions for summary judgement on copyright infringement issues in May 2022. 

The jury returned a verdict in January 2024 that the tattoo inked by KVD and some of her social media posts are not substantially similar to Sedlik’s photo. The jury also determined that the rest of KVD’s social media posts, documenting her process of creating the tattoo in question, were fair use. In short, the jury concluded there was no copyright infringement.

On May 3rd, 2024, the district court judge denied Sedlik’s motions for judgment as a matter of law and for a new trial. Faced with the jury’s adverse decision, Sedlik argued, among other things, that the jury erred in finding no substantial similarity. The judge, however, upheld the jury’s finding that KVD’s works had a different concept and feel from Sedlik’s photo and that KVD only copied the unprotected elements of the photo. Sedlik tried to argue that the legal question of fair use should not have been left to the jury. However, the court was unpersuaded, highlighting that Sedlik had remained silent on this procedural issue until after receiving an unfavorable verdict.  

Following the ruling on his motions, Sedlik appealed, and the case is now in front of the 9th Circuit. Anticipating the far-reaching consequences for artists and authors depending on how the 9th Circuit will interpret Warhol, Authors Alliance filed an amicus brief in support of KVD.

Both Sedlik and KVD in this case argued that Warhol supported their side. Sedlik proposed a unique test, that a fair use must either target the original copyrighted work, or otherwise have a compelling justification for the use. In our amicus brief, we illustrated how that is not the correct reading of Warhol. Under Warhol, a distinct purpose is required for the first factor to tilt in favor of fair use. The Warhol Court only analyzed “targeting” and “compelling justification” because Warhol’s secondary use of the Goldsmith photo shared the exact same purpose as the photo, both for the purpose of appearing on the cover of a magazine. This is not the case with KVD’s freehand tattoo and Sedlik’s photo: they serve substantially distinct purposes.   

Authors routinely borrow from other’s copyrighted works for reporting, research, teaching, as well as to memorialize, preserve, or provide historical context. These uses by authors have historically been considered fair use, and often have purposes distinct from the copyrighted works used; but they do not necessarily “target” the works being used, nor do they have “compelling justifications” beyond the broad justification that authors are promoting the goal of copyright—”to promote the progress of science and the arts.”

In our brief, we also stressed how a successful commercial entity can nevertheless make noncommercial uses, as already demonstrated in the case of Google Books and Hachette. We also argued that social media posts are not commercial by default, just by virtue of drawing attention to the original poster. Many successful authors maintain active social media presence. The fact that authors invariably write to capture and build an audience through these sites does not automatically render their uses “commercial.” “Commerciality” under the fair use analysis has always been limited to the act of merchandising in the market, such as selling stamps, t-shirts, or mugs.

Finally, we explained to the court why copyright holders must offer concrete evidence to prove the existence, or likelihood of developing, licensing market, before they can argue a secondary use serves as “market substitute.” If we accepted Sedlik’s argument that his protected market includes everything that he’s willing to receive licensing fees for, it will all but wipe out fair use. We want authors and other creatives to continue to engage in fair use, including to document their creative processes—as KVD has done in this case in her social media posts, without being told they have to pay for each instance of use as soon as demanded by a rightsholder.  

Books are Big AI’s Achilles Heel

Posted May 13, 2024

By Dave Hansen and Dan Cohen

Image of the Rijksmuseum by Michael D Beckwith. Image dedicated to the Public Domain.

Rapidly advancing artificial intelligence is remaking how we work and live, a revolution that will affect us all. While AI’s impact continues to expand, the operation and benefits of the technology are increasingly concentrated in a small number of gigantic corporations, including OpenAI, Google, Meta, Amazon, and Microsoft.

Challenging this emerging AI oligopoly seems daunting. The latest AI models now cost billions of dollars, beyond the budgets of startups and even elite research universities, which have often generated the new ideas and innovations that advance the state of the art.

But universities have a secret weapon that might level the AI playing field: their libraries. Computing power may be one important part of AI, but the other key ingredient is training data. Immense scale is essential for this data—but so is its quality.

Given their voracious appetite for text to feed their large language models, leading AI companies have taken all the words they can find, including from online forums, YouTube subtitles, and Google Docs. This is not exactly “the best that has been thought and said,” to use Matthew Arnold’s pointed phrase. In Big AI’s haphazard quest for quantity, quality has taken a back seat. The frequency of “hallucinations”—inaccuracies currently endemic to AI outputs—are cause for even greater concern.

The obvious way to rectify this lack of quality and tenuous relationship to the truth is by ingesting books. Since the advent of the printing press, authors have published well over 100 million books. These volumes, preserved for generations on the shelves of libraries, are perhaps the most sophisticated reflection of human thinking from the beginning of recorded history, holding within them some of our greatest (and worst) ideas. On average, they have exceptional editorial quality compared to other texts, capture a breadth and diversity of content, a vivid mix of styles, and use long-form narrative to communicate nuanced arguments and concepts.

The major AI vendors have sought to tap into this wellspring of human intelligence to power the artificial, although often through questionable methods. Some companies have turned to an infamous set of thousands of books, apparently retrieved from pirate websites without permission, called “Books3.” They have also sought licenses directly from publishers, using their massive budgets to buy what they cannot scavenge. Meta even considered purchasing one of the largest publishers in the world, Simon & Schuster.

As the bedrock of our shared culture, and as the possible foundation for better artificial intelligence, books are too important to flow through these compromised or expensive channels. What if there were a library-managed collection made available to a wide array of AI researchers, including at colleges and universities, nonprofit research institutions, and small companies as well as large ones?

Such vast collections of digitized books exist right now. Google, by pouring millions of dollars into its long-running book scanning project, has access to over 40 million books, a valuable asset they undoubtedly would like to keep exclusive. Fortunately, those digitized books are also held by Google’s partner libraries. Research libraries and other nonprofits have additional stockpiles of digitized books from their own scanning operations, derived from books in their own collections. Together, they represent a formidable aggregation of texts.

A library-led training data set of books would diversify and strengthen the development of AI. Digitized research libraries are more than large enough, and of substantially higher quality, to offer a compelling alternative to existing scattershot data sets. These institutions and initiatives have already worked through many of the most challenging copyright issues, at least for how fair use applies to nonprofit research uses such as computational analysis. Whether fair use also applies to commercial AI, or models built from iffy sources like Books3, remains to be seen.

Library-held digital texts come from lawfully acquired books—an investment of billions of dollars, it should be noted, just like those big data centers—and libraries are innately respectful of the interests of authors and rightsholders by accounting for concerns about consent, credit, and compensation. Furthermore, they have a public-interest disposition that can take into account the particular social and ethical challenges of AI development. A library consortium could distinguish between the different needs and responsibilities of academic researchers, small market entrants, and large commercial actors. 

If we don’t look to libraries to guide the training of AI on the profound content of books, we will see a reinforcement of the same oligopolies that rule today’s tech sector. Only the largest, most well-resourced companies will acquire these valuable texts, driving further concentration in the industry. Others will be prevented from creating imaginative new forms of AI based on the best that has been thought and said. As they have always done, by democratizing access libraries can support learning and research for all, ensuring that AI becomes the product of the many rather than the few.

Further reading on this topic: “Towards a Books Data Commons for AI Training,” by Paul Keller, Betsy Masiello, Derek Slater, and Alek Tarkowski.

This week, Authors Alliance celebrates its 10th anniversary with an event in San Francisco on May 17 (We still have space! Register for free here) titled “Authorship in an Age of Monopoly and Moral Panics,” where we will highlight obstacles and opportunities of new technology. This piece is part of a series leading up to the event.

Writing About Real People Update: Right of Publicity, Voice Protection, and Artificial Intelligence

Posted March 7, 2024
Photo by Jason Rosewell on Unsplash

Some of you may recall that Authors Alliance published our long-awaited guide, Writing About Real People, earlier this year. One of the major topics in the guide is the right of publicity—a right to control use of one’s own identity, particularly in the context of commercial advertising. These issues have been in the news a lot lately as generative AI poses new questions about the scope and application of the right of publicity. 

Sound-alikes and the Right of Publicity

One important right of publicity question in the genAI era concerns the increasing prevalence of “sound-alikes” created using generative AI systems. The issue of AI-generated voices that mimicked real people came to the public’s attention with the apparently convincing “Heart on My Sleeve” song, imitating Drake and the Weeknd, and tools that facilitate creating songs imitating popular singers have increased in number and availability

AI-generated soundalikes are a particularly interesting use of this technology when it comes to the right of publicity because one of the seminal right of publicity cases, taught in law schools and mentioned in primers on the topic, concerns a sound-alike from the analog world. In 1986, the Ford Motor Company hired an advertising agency to create a TV commercial. The agency obtained permission to use “Do You Wanna Dance,” a song Bette Midler had famously covered, in its commercial. But when the ad agency approached Midler about actually singing the song for the commercial, she refused. The agency then hired a former backup singer of Midler’s to record the song, apparently asking the singer to imitate Midler’s voice in the recording. A federal court found that this violated Midler’s right of publicity under California law, even though her voice was not actually used. Extending this holding to AI-generated voices seems logical and straightforward—it is not about the precise technology used to create or record the voice, but about the end result the technology is used to achieve. 

Right of Publicity Legislation

The right of publicity is a matter of state law. In some states, like California and New York, the right of publicity is established via statute, and in others, it’s a matter of common law (or judge-made law). In recent months, state legislatures have proposed new laws that would codify or expand the right of publicity. Similarly, many have called for the establishment of a federal right of publicity, specifically in the context of harms caused by the rise of generative AI. One driving force behind calls for the establishment of a federal right of publicity is the patchwork nature of state right of publicity laws: in some states, the right of publicity extends only to someone’s name, image, likeness, voice, and signature, but in others, it’s much broader. While AI-generated content and the ways in which it is being used certainly pose new challenges for courts considering right of publicity violations, we are skeptical that new legislation is the best solution. 

In late January, the No Artificial Intelligence Fake Replicas and Unauthorized Duplications Act of 2024 (or “No AI FRAUD Act”) was introduced in the House of Representatives. The No AI FRAUD Act would create a property-like right in one’s voice and likeness, which is transferable to other parties. It targets voice “cloning services” and mentions the “Heart on My Sleeve” controversy specifically. But civil societies and advocates for free expression have raised alarm about the ways in which the bill would make it easier for creators to actually lose control over their own personality rights while also impinging on others’ First Amendment rights due to its overbreadth and the property-like nature of the right it creates. While the No AI FRAUD Act contains language stating that the First Amendment is a defense to liability, it’s unclear how effective this would be in practice (and as we explain in the Writing About Real People Guide, the First Amendment is always a limitation on laws affecting freedom of expression). 

The Right of Publicity and AI-Generated Content

In the past, the right of publicity has been described as “name, image, and likeness” rights. What is interesting about AI-generated content and the right of publicity is that a person’s likeness can be used in a more complete way than ever before. In some cases, both their appearance and voice are imitated, associated with their name, and combined in a way that makes the imitation more convincing. 

What is different about this iteration of right of publicity questions is the actors behind the production of the soundalikes and imitations, and, to a lesser extent, the harms that might flow from these uses. A recent use of a different celebrity’s likeness in connection with an advertisement is instructive on this point. Earlier this year, advertisements emerged on various platforms featuring an AI-generated Taylor Swift participating in a Le Creuset cookware giveaway. These ads contained two separate layers of deceptiveness: most obviously, that Swift was AI-generated and did not personally appear in the ad, but more bafflingly, that they were not Le Creuset ads at all. The ads were part of a scam whereby users might pay for cookware they would never receive, or enter credit card details which could then be stolen or otherwise used for improper purposes. Compared to more traditional conceptions of advertising, the unfair advantages and harms caused by the use of Swift’s voice and likeness are much more difficult to trace. Taylor Swift’s likeness and voice were appropriated by scammers to trick the public into thinking they were interacting with Le Creuset advertising. 

It may be that the right of publicity as we know it (and as we discuss it in the Writing About Real People Guide) is not well-equipped to deal with these kinds of situations. But it seems to us that codifying the right of publicity in federal law is not the best approach. Just as Bette Midler had a viable claim under California’s right of publicity statute back in 1992, Taylor Swift would likely have a viable claim against Le Creuset if her likeness had been used by that company in connection with commercial advertising. The problem is not the “patchwork of state laws,” but that this kind of doubly-deceptive advertising is not commercial advertising at all. On a practical level, it’s unclear what party could even be sued by this kind of use. Certainly not Le Creuset. And it seems to us unfair to say that the creator of the AI technology sued should be left holding the bag, just because someone used it for fraudulent purposes. The real fraudsters—anonymous but likely not impossible to track down—are the ones who can and should be pursued under existing fraud laws. 

Authors Alliance has said elsewhere that reforms to copyright law cannot be the solution to any and all harms caused by generative AI. The same goes for the intellectual property-like right of publicity. Sensible regulation of platforms, stronger consumer protection laws, and better means of detecting and exposing AI-generated content are possible solutions to the problems that the use of AI-generated celebrity likenesses have brought about. To instead expand intellectual property rights under a federal right of publicity statute risks infringing on our First Amendment freedoms of speech and expression.

Authors Alliance Submits Long-Form Comment to Copyright Office in Support of Petition to Expand Existing Text and Data Mining Exemption 

Posted January 29, 2024
Photo by Simona Sergi on Unsplash

Last month, Authors Alliance submitted detailed comments in response to the Copyright Office’s Notice of Proposed Rulemaking in support of our petition to expand the existing Digital Millennium Copyright Act (DMCA) exemptions that enable text and data mining (TDM) as part of this year’s §1201 rulemaking cycle

To recap: our expansion petitions ask the Copyright Office to modify the existing TDM exemption so that researchers who assemble corpora of ebooks or films on which to conduct text and data mining are able to share that corpus with other academic researchers, where this second group of researchers qualifies under the exemption. Under the current exemption, academic researchers are only able to share their corpora with other qualified researchers for purposes of “collaboration and verification.” This simple change would eliminate the need for duplicative efforts to remove digital locks from ebooks and films, a time and resource-intensive process, broadening the group of academic researchers who are able to use the exemption. 

Our comment argues that the existing TDM exemption has begun to enable valuable digital humanities research and teaching, but that the proposed expansion would go much further towards enabling this research and helping TDM researchers reach their goals. The comment is accompanied by 13 letters of support from researchers, educators, and funding organizations, highlighting the research that has been done in reliance on the exemption, and explaining why this expansion is necessary. Our thanks go out to our stellar clinical team at UC Berkeley’s Samuelson Law, Technology & Public Policy Clinic—law students Mathew Cha and Zhudi Huang, and clinical supervisor Jennifer Urban—for writing and submitting this comment on our behalf. We are also grateful to our co-petitioners, the Library Copyright Alliance and American Association of University Professors, for their support on this comment. 

Ambiguity in “Collaboration”

One reason the expansion is necessary is the uncertainty over what constitutes “collaboration” under the existing exemption. Researchers have open questions about what level of individual contribution to a project would make researchers “collaborators” under the exemption. As our comment explains, collaboration can come in a number of different forms, from “formal collaborations under the auspice of a grant, [to] ad hoc collaborations that result from two teams discovering that they are working on similar material to the same ends, or even discussions at conferences between members of a loose network of scholars working on the same broad set of interests.” But it is not clear which of these activities is “collaboration” for the purposes of the exemption. And this uncertainty has had a chilling effect on the socially valuable research made possible by the exemption. 

Costly Corpora Creation 

Our comment also highlights the vast costs that go into creating a usable corpus for TDM research. Institutions whose researchers are conducting TDM research pursuant to the exemption must lawfully own the works in question, or license them through a license that is not time-limited. But these costs pale in comparison to the required computing resources—a cost which is compounded by the exemption’s strict security requirements—and human labor involved in bypassing technical protection measures and assembling a corpus. Moreover, it’s important to recognize that there is simply not a tremendous amount of grant funding or even institutional support available to TDM researchers. 

Because corpora are so costly to assemble and create, we believe it to be reasonable to permit researchers to share their corpora with researchers at other institutions who want to conduct independent TDM research on these corpora. As the exemption currently stands, researchers interested in pre-existing corpora must duplicate the efforts of the previous researchers, incurring massive costs along the way. We’ve already seen indications that these costs can lead researchers to avoid certain research questions and areas of study altogether. As our comment explains, this “duplicative circumvention” can be avoided by changing the language of the exemption to permit corpora sharing between qualified researchers at separate institutions. 

Equity Issues

Worse still, not all institutions are able to bear these expenses. Our comment explains how the current exemption’s prohibition on sharing beyond collaboration and verification—and consequent duplication of prior labor—-”create[s] barriers that can prevent smaller and less-well-resourced institutions from conducting TDM research at all.” This creates inequity in what type of institutions can support TDM projects, and what types of researchers can conduct them. The unfortunate result has been that large institutions that have “the resources to compensate and maintain technical staff and infrastructure” are able to support TDM research under the exemption, while smaller institutions are not. 

Values of Corpora Sharing

Our comment explains how allowing limited sharing of corpora under the exemption would go a long way towards lowering barriers to entry for TDM research and ameliorating the equity issues described above. Since digital humanities is already an under-resourced field, the effects of enabling researchers to share their corpora with other academic researchers could be quite profound. 

Researchers who wrote letters in support of the petition described a multitude of exciting projects, and have built “a rich set of corpora to study, such as a collection of fiction written by African American writers, a collection of books banned in the United States, and a curated corpus of movies and television with an ‘emphasis on racial, ethnic, sexual, and gender diversity.’” Many of those who wrote letters in support of our petition recounted requests they’ve gotten from other researchers to use their corpora, and who were frustrated that the exemption’s prohibition on non-collaborative sharing and their limited capacity for collaboration prevented them from sharing these corpora. 

Allowing new researchers with new research questions to study these corpora could reveal new insights about these bodies of work. As we explain, “in the same way a single literary work or motion picture can evince multiple meanings based on the lens of analysis used, when different researchers study one corpus, they are able to pose different research questions and apply different methodologies, ultimately revealing new and original findings . . . . Enabling broader sharing and thus, increasing the number of researchers that can study a corpus, will allow a body of works to be better understood beyond the initial ‘limited set of research questions.’”

Fair Use

The 1201 rulemaking process for exemptions to DMCA § 1201’s prohibition on breaking digital locks requires that the proposed activity be a fair use. In the 2021 proceedings, the Office recognized TDM for research and teaching purposes as a fair use. Because the expansion we’re seeking is relatively minor, our comment explains that the types of uses we are asking the Office to permit researchers to make is also fair use. Our comment explains that each of the four fair use factors favor fair use in the context of the proposed expansion. We further explain why the enhanced sharing the expansion would provide does not harm the market for the original works under factor four: because institutions must lawfully own (or license under a non-time-limited license) the works that their researchers wish to conduct TDM on, it makes no difference from a market standpoint whether researchers bypass technical protection measures themselves, or share another institution’s corpus. Copyright holders are not harmed when researchers at one institution share a corpus created by researchers at another institution, since both institutions must purchase the works in order to be eligible under the exemption. 

What’s Next?

If there are parties that oppose our proposed expansion, they have until February 20th to submit opposition comments to the Copyright Office. Then, on March 19th, our reply comments to any opposition comments will be due. We will keep our readers and members apprised as the process continues to move forward.

Hachette v. IA Amicus Briefs: Highlight on Privacy and Controlled Digital Lending

Posted January 16, 2024

Photo by Matthew Henry on Unsplash

Over the holidays you may have read about the amicus brief we submitted in the Hachette v. Internet Archive case about library controlled digital lending (CDL), which we’ve been tracking for quite some time. Our brief was one of 11 amicus briefs filed that explained to the court the broader implications of the case. Internet Archive itself has a short overview of the others already (representing 20 organizations and 298 individuals–mostly librarians and legal experts). 

I thought it would be worthwhile to highlight some of the important issues identified by these amici that did not receive much attention earlier in the lawsuit. This post is about the reader’s privacy issues raised by several amici in support of Internet Archive and CDL. Later this week we’ll have another post focused on briefs and arguments about why the district court inappropriately construed Internet Archive’s lending program as “commercial.” 

Privacy and CDL 

One aspect of library lending that’s really special is the privacy that readers are promised when they check out a book. Most states have special laws that require libraries to protect readers’ privacy, something that libraries enthusiastically embrace (e.g., see the ALA Library Bill of Rights) as a way to help foster free inquiry and learning among readers.  Unlike when you buy an ebook from Amazon–which keeps and tracks detailed reader information–dates, times, what page you spent time on, what you highlighted–libraries strive to minimize the data they keep on readers to protect their privacy. This protects readers from data breaches or other third party demands for that data. 

The brief from the Center for Democracy and Technology, Library Freedom Project, and Public Knowledge spends nearly 40 pages explaining why the court should consider reader privacy as part of its fair use calculus. Represented by Jennifer Urban and a team of students at the Samuelson Law, Technology and Public Policy Clinic at UC Berkeley Law (disclosure: the clinic represents Authors Alliance on some matters, and we are big fans of their work), the brief masterfully explains the importance of this issue. From their brief, below is a summary of the argument (edited down for length): 

The conditions surrounding access to information are important. As the Supreme Court has repeatedly recognized, privacy is essential to meaningful access to information and freedom of inquiry. But in ruling against the Internet Archive, the district court did not consider one of CDL’s key advantages: it preserves libraries’ ability to safeguard reader privacy. When employing C

DL, libraries digitize their own physical materials and loan them on a digital-to-physical, one-to-one basis with controls to prevent redistribution or sharing. CDL provides extensive, interrelated benefits to libraries and patrons, such as increasing accessibility for people with disabilities or limited transportation, improving access to rare and fragile materials, facilitating interlibrary resource sharing—and protecting reader privacy. For decades, libraries have protected reader privacy, as it is fundamental to meaningful access to information. Libraries’ commitment is reflected in case law, state statutes, and longstanding library practices. CDL allows libraries to continue protecting reader privacy while providing access to information in an increasingly digital age. Indeed, libraries across the country, not just the Internet Archive, have deployed CDL to make intellectual materials more accessible. And while increasing accessibility, these CDL systems abide by libraries’ privacy protective standards. 

Commercial digital lending options, by contrast, fail to protect reader privacy; instead, they threaten it. These options include commercial aggregators—for-profit companies that “aggregate” digital content from publishers and license access to these collections to libraries and their patrons—and commercial e-book platforms, which provide services for reading digital content via e-reading devices, mobile applications (“apps”), or browsers. In sharp contrast to libraries, these commercial actors track readers in intimate detail. Typical surveillance includes what readers browse, what they read, and how they interact with specific content—even details like pages accessed or words highlighted. The fruits of this surveillance may then be shared with or sold to third parties. Beyond profiting from an economy of reader surveillance, these commercial actors leave readers vulnerable to data breaches by collecting and retaining vast amounts of sensitive reader data. Ultimately, surveilling and tracking readers risks chilling their desire to seek information and engage in the intellectual inquiry that is essential to American democracy. 

Readers should not have to choose to either forfeit their privacy or forgo digital access to information; nor should libraries be forced to impose this choice on readers. CDL provides an ecosystem where all people, including those with mobility limitations and print disabilities, can pursue knowledge in a privacy-protective manner. . . . 

An outcome in this case that prevents libraries from relying on fair use to develop and deploy CDL systems would harm readers’ privacy and chill access to information. But an outcome that preserves CDL options will preserve reader privacy and access to information. The district court should have more carefully considered the socially beneficial purposes of library-led CDL, which include protecting patrons’ ability to access digital materials privately, and the harm to copyright’s public benefit of disallowing libraries from using CDL. Accordingly, the district court’s decision should be reversed.

The court below considered CDL copies and licensed ebook copies as essentially equivalent and concluded that the CDL copies IA provided acted as substitutes for licensed copies. Authors Alliance’s amicus brief points out some of the ways that CDL copies actually quite different significantly from licensed copies. It seems to me that this additional point about protection of reader privacy–and the protection of free inquiry that comes with it–is exactly the kind of distinguishing public benefit that the lower court should have considered but did not. 

You can read the full brief from the Center for Democracy and Technology, Library Freedom Project, and Public Knowledge here. 

Licensing research content via agreements that authorize uses of artificial intelligence

Posted January 10, 2024
Photo by Hal Gatewood on Unsplash

This is a guest post by Rachael G. Samberg, Timothy Vollmer, and Samantha Teremi, professionals within the Office of Scholarly Communication Services at UC Berkeley Library. 

On academic and library listservs, there has emerged an increasingly fraught discussion about licensing scholarly content when scholars’ research methodologies rely on artificial intelligence (AI). Scholars and librarians are rightfully concerned that non-profit educational research methodologies like text and data mining (TDM) that can (but do not necessarily) incorporate usage of AI tools are being clamped down upon by publishers. Indeed, libraries are now being presented with content license agreements that prohibit AI tools and training entirely, irrespective of scholarly purpose. 

Conversely, publishers, vendors, and content creators—a group we’ll call “rightsholders” here—have expressed valid concerns about how their copyright-protected content is used in AI training, particularly in a commercial context unrelated to scholarly research. Rightsholders fear that their livelihoods are being threatened when generative AI tools are trained and then used to create new outputs that they believe could infringe upon or undermine the market for their works.

Within the context of non-profit academic research, rightsholders’ fears about allowing AI training, and especially non-generative AI training, are misplaced. Newly-emerging content license agreements that prohibit usage of AI entirely, or charge exorbitant fees for it as a separately-licensed right, will be devastating for scientific research and the advancement of knowledge. Our aim with this post is to empower scholars and academic librarians with legal information about why those licensing outcomes are unnecessary, and equip them with alternative licensing language to adequately address rightsholders’ concerns

To that end, we will: 

  1. Explain the copyright landscape underpinning the use of AI in research contexts;
  2. Address ways that AI usage can be regulated to protect rightsholders, while outlining opportunities to reform contract law to support scholars; and 
  3. Conclude with practical language that can be incorporated into licensing agreements, so that libraries and scholars can continue to achieve licensing outcomes that satisfy research needs.

Our guidance is based on legal analysis as well as our views as law and policy experts working within scholarly communication. While your mileage or opinions may vary, we hope that the explanations and tools we provide offer a springboard for discussion within your academic institutions or communities about ways to approach licensing scholarly content in the age of AI research.

Copyright and AI training

As we have recently explored in presentations and posts, the copyright law and policy landscape underpinning the use of AI models is complex, and regulatory decision-making in the copyright sphere will have ramifications for global enterprise, innovation, and trade. A much-discussed group of lawsuits and a parallel inquiry from the U.S. Copyright Office raise important and timely legal questions, many of which we are only beginning to understand. But there are two precepts that we believe are clear now, and that bear upon the non-profit education, research, and scholarship undertaken by scholars who rely on AI models. 

First, as the UC Berkeley Library has explained in greater detail to the Copyright Office, training artificial intelligence is a fair use—and particularly so in a non-profit research and educational context. (For other similar comments provided to the Copyright Office, see, e.g., the submissions of Authors Alliance and Project LEND). Maintaining its continued treatment as fair use is essential to protecting research, including TDM. 

TDM refers generally to a set of research methodologies reliant on computational tools, algorithms, and automated techniques to extract revelatory information from large sets of unstructured or thinly-structured digital content. Not all TDM methodologies necessitate usage of AI models in doing so. For instance, the words that 20th century fiction authors use to describe happiness can be searched for in a corpus of works merely by using algorithms looking for synonyms and variations of words like “happiness” or “mirth,” with no AI involved. But to find examples of happy characters in those books, a researcher would likely need to apply what are called discriminative modeling methodologies that first train AI on examples of what qualities a happy character demonstrates or exhibits, so that the AI can then go and search for occurrences within a larger corpus of works. This latter TDM process involves AI, but not generative AI; and scholars have relied non-controversially on this kind of non-generative AI training within TDM for years. 

Previous court cases like Authors Guild v. HathiTrust, Authors Guild v. Google, and A.V. ex rel. Vanderhye v. iParadigms have addressed fair use in the context of TDM and confirmed that the reproduction of copyrighted works to create and conduct text and data mining on a collection of copyright-protected works is a fair use. These cases further hold that making derived data, results, abstractions, metadata, or analysis from the copyright-protected corpus available to the public is also fair use, as long as the research methodologies or data distribution processes do not re-express the underlying works to the public in a way that could supplant the market for the originals. 

For the same reasons that the TDM processes constitute fair use of copyrighted works in these contexts, the training of AI tools to do that text and data mining is also fair use. This is in large part because of the same transformativeness of the purpose (under Fair Use Factor 1) and because, just like “regular” TDM that doesn’t involve AI, AI training does not reproduce or communicate the underlying copyrighted works to the public (which is essential to the determination of market supplantation for Fair Use Factor 4). 

But, while AI training is no different from other TDM methodologies in terms of fair use, there is an important distinction to make between the inputs for AI training and generative AI’s outputs. The overall fair use of generative AI outputs cannot always be predicted in advance: The mechanics of generative AI models’ operations suggest that there are limited instances in which generative AI outputs could indeed be substantially similar to (and potentially infringing of) the underlying works used for training; this substantial similarity is possible typically only when a training corpus is rife with numerous copies of the same work. And a recent case filed by the New York Times addresses this potential similarity problem with generative AI outputs.  

Yet, training inputs should not be conflated with outputs: The training of AI models by using copyright-protected inputs falls squarely within what courts have already determined in TDM cases to be a transformative fair use. This is especially true when that AI training is conducted for non-profit educational or research purposes, as this bolsters its status under Fair Use Factor 1, which considers both transformativeness and whether the act is undertaken for non-profit educational purposes. 

Were a court to suddenly determine that training AI was not fair use, and AI training was subsequently permitted only on “safe” materials (like public domain works or works for which training permission has been granted via license), this would curtail freedom of inquiry, exacerbate bias in the nature of research questions able to be studied and the methodologies available to study them, and amplify the views of an unrepresentative set of creators given the limited types of materials available with which to conduct the studies.

The second precept we uphold is that scholars’ ability to access the underlying content to conduct fair use AI training should be preserved with no opt-outs from the perspective of copyright regulation. 

The fair use provision of the Copyright Act does not afford copyright owners a right to opt out of allowing other people to use their works in any other circumstance, for good reason: If content creators were able to opt out of fair use, little content would be available freely to build upon. Uniquely allowing fair use opt-outs only in the context of AI training would be a particular threat for research and education, because fair use in these contexts is already becoming an out-of-reach luxury even for the wealthiest institutions. What do we mean?

In the U.S., the prospect of “contractual override” means that, although fair use is statutorily provided for, private parties like publishers may “contract around” fair use by requiring libraries to negotiate for otherwise lawful activities (such as conducting TDM or training AI for research). Academic libraries are forced to pay significant sums each year to try to preserve fair use rights for campus scholars through the database and electronic content license agreements that they sign. This override landscape is particularly detrimental for TDM research methodologies, because TDM research often requires use of massive datasets with works from many publishers, including copyright owners who cannot be identified or who are unwilling to grant such licenses. 

So, if the Copyright Office or Congress were to enable rightsholders to opt-out of having their works fairly used for training AI for scholarship, then academic institutions and scholars would face even greater hurdles in licensing content for research. Rightsholders might opt out of allowing their work to be used for AI training fair uses, and then turn around and charge AI usage fees to scholars (or libraries)—essentially licensing back fair uses for research. 

Fundamentally, this undermines lawmakers’ public interest goals: It creates a risk of rent-seeking or anti-competitive behavior through which a rightsholder can demand additional remuneration or withhold granting licenses for activities generally seen as being good for public knowledge or that rely on exceptions like fair use. And from a practical perspective, allowing opt-outs from fair uses would impede scholarship by or for research teams who lack grant or institutional funds to cover these additional licensing expenses; penalize research in or about underfunded disciplines or geographical regions; and result in bias as to the topics and regions that can be studied. 

“Fair use” does not mean “unregulated” 

Although training AI for non-profit scholarly uses is fair use from a copyright perspective, we are not suggesting AI training should be unregulated. To the contrary, we support guardrails because training AI can carry risk. For example, researchers have been able to use generative AI like ChatGPT to solicit personal information by bypassing platform safeguards.

To address issues of privacy, ethics, and the rights of publicity (which govern uses of people’s voices, images, and personas), there should be the adoption of best practices, private ordering, and other regulations. 

For instance, as to best practices, scholar Matthew Sag has suggested preliminary guidelines to avoid violations of privacy and the right to publicity. First, he recommends that AI platforms avoid training their large language models on duplicates of the same work. This would reduce the likelihood that the models could produce copyright-infringing outputs (due to memorization concerns), and it would also lessen the likelihood that any content containing potentially private or sensitive information would be outputted from having been fed into the training process multiple times. Second, Sag suggests that AI platforms engage in “reinforcement learning through human feedback” when training large language models. This practice could cut down on privacy or rights of publicity concerns by involving human feedback at the point of training, instead of leveraging filtering at the output stage.  

Private ordering would rely on platforms or communities to implement appropriate policies governing privacy issues, rights of publicity, and ethical concerns. For example, the UC Berkeley Library has created policies and practices (called “Responsible Access Workflows”) to help it make decisions around whether—and how—special collection materials may be digitized and made available online. Our Responsible Access Workflows require review of collection materials across copyright, contracts, privacy, and ethics parameters. Through careful policy development, the Library applies an ethics of care approach to making available online the collection content with ethical concerns. Even if content is not shared openly online, it doesn’t mean it’s unavailable for researchers for use in person; we simply have decided not to make that content available in digital formats with lower friction for use. We aim to apply transparent information about our decision-making, and researchers must make informed decisions about how to use the collections, whether or not they are using them in service of AI.

And finally, concerning regulations, countries like those in the EU have recently introduced an AI training framework that requires, among other things, the disclosure of source content, and the rights for content creators to opt out of having their works included in training sets except when the AI training is being done for research purposes by research organizations, cultural heritage institutions, and their members or scholars. United States agencies could consider implementing similar regulations here. 

But from a copyright perspective, and within non-profit academic research, fair use in AI training should be preserved without the opportunity to opt out for the reasons we discuss above. Such an approach regarding copyright would also be consistent with the distinction the EU has made for AI training in academic settings, as the EU’s Digital Single Market Directive bifurcates practices outside the context of scholarly research

While we favor regulation that preserves fair use, it is also important to note that merely preserving fair use rights in scholarly contexts for training AI is not the end of the story in protecting scholarly inquiry. So long as the United States permits contractual override of fair uses, libraries and researchers will continue to be at the mercy of publishers aggregating and controlling what may be done with the scholarly record, even if authors dedicate their content to the public domain or apply a Creative Commons license to it. So in our view, the real work that should be done is pursuing legislative or regulatory arrangements like the approximately 40 other countries that have curtailed the ability of contracts to abrogate fair use and other limitations and exceptions to copyright within non-profit scholarly and educational uses. This is a challenging, but important, mission.

Licensing guidance in the meantime 

While the statutory, regulatory, and private governance landscapes are being addressed, libraries and scholars need ways to preserve usage rights for content when training AI as part of their TDM research methodologies. We have developed sample license language intended to address rightsholders’ key concerns while maintaining scholars’ ability to train AI in text and data mining research. We drafted this language to be incorporated into amendments to existing licenses that fail to address TDM, or into stand-alone TDM and AI licenses; however, it is easily adaptable into agreements-in-chief (and we encourage you to do so). 

We are certain our terms can continue to be improved upon over time or be tailored for specific research needs as methodologies and AI uses change. But in the meantime, we think they are an important step in the right direction.

With that in mind, it is important to understand that within contracts applying U.S. law, more specific language controls over general language in a contract. So, even if there is a clause in a license agreement that preserves fair use, if it is later followed by a TDM clause that restricts how TDM can be conducted (and whether AI can be used), then that more specific language governs TDM and AI usage under the agreement. This means that libraries and scholars must be mindful when negotiating TDM and AI clauses as they may be contracting themselves out of rights they would otherwise have had under fair use. 

So, how can a library or scholar negotiate sufficient AI usage rights while acknowledging the concerns of  publishers? We believe publishers have attempted to curb AI usage because they are concerned about: (1) the security of their licensed products, and the fear that researchers will leak or release content behind their paywall; and (2) AI being used to create a competing product that could substitute for the original licensed product and undermine their share of the market. While these concerns are valid, they reflect longstanding fears over users’ potential generalized misuse of licensed materials in which they do not hold copyright. But publishers are already able to—and do—impose contractual provisions disallowing the creation of derivative products and systematically sharing licensed content with third-parties, so additionally banning the use of AI in doing so is, in our opinion, unwarranted.

We developed our sample licensing language to precisely address these concerns by specifying in the grant of license that research results may be used and shared with others in the course of a user’s academic or non-profit research “except to the extent that doing so would substantially reproduce or redistribute the original Licensed Materials, or create a product for use by third parties that would substitute for the Licensed Materials.” Our language also imposes reasonable security protections in the research and storage process to quell fears of content leakage. 

Perhaps most importantly, our sample licensing language preserves the right to conduct TDM using “machine learning” and “other automated techniques” by expressly including these phrases in the definition for TDM, thereby reserving AI training rights (including as such AI training methodologies evolve), provided that no competing product or release of the underlying materials is made. 

The licensing road ahead

As legislation and standards around AI continue to develop, we hope to see express contractual allowance for AI training become the norm in academic licensing. Though our licensing language will likely need to adapt to and evolve with policy changes and research or technological advancements over time, we hope the sample language can now assist other institutions in their negotiations, and help set a licensing precedent so that publishers understand the importance of allowing AI training in non-profit research contexts. While a different legislative and regulatory approach may be appropriate in the commercial context, we believe that academic research licenses should preserve the right to incorporate AI, especially without additional costs being passed to subscribing institutions or individual users, as a fundamental element of ensuring a diverse and innovative scholarly record.

Authors Alliance Submits Amicus Brief to the Second Circuit in Hachette Books v. Internet Archive

Posted December 21, 2023
Photo by Dylan Dehnert on Unsplash

We are thrilled to announce that we’ve submitted an amicus brief to the Second Circuit Court of Appeals in Hachette Books v. Internet Archive—the case about whether controlled digital lending is a fair use—in support of the Internet Archive. Authored by Authors Alliance Senior Staff Attorney, Rachel Brooke, the brief reprises many of the arguments we made in our amicus brief in the district court proceedings and elaborates on why and how the lower court got it wrong, and why the case matters for our members and other authors who write to be read.

The Case

We’ve been writing about this case for years—since the complaint was first filed back in 2020. But to recap: a group of trade publishers sued the Internet Archive in federal court in the Southern District of New York over (among other things) the legality of its controlled digital lending (CDL) program. The publishers argued that the practice infringed their copyrights, and Internet Archive defended its project on the grounds that it was fair use. We submitted an amicus brief in support of IA and CDL (which we have long supported as a fair use) to the district court, explaining that copyright is about protecting authors, and many authors strongly support CDL

The case finally went to oral argument before a judge in March of this year. Unfortunately, the judge ruled against Internet Archive, finding that each of the fair use factors favored the publishers. Internet Archive indicated that it planned to appeal, and we announced that we planned to support them in those efforts. Now, the case is before the Second Circuit Court of Appeals. After Internet Archive filed its opening brief last week, we (and other amici) filed our briefs in support of a reversal of the lower court’s decision.

Our Brief

Our amicus brief argues, in essence, that the district court  judge failed to adequately consider the interests of authors.  While the commercial publishers in the case did not support CDL, those publishers’ interests do not always align with authors’ and they certainly do not speak for all authors. We conducted outreach to authors, including launching a CDL survey, and uncovered a diversity of views on CDL—most of them extremely positive. We offered up these authors’ perspectives to show the court that many authors do support CDL, contrary to the representations of the publishers. Since copyright is about incentivizing new creation for the benefit of the public and protecting author interests, we felt these views were important for the Second Circuit to hear. 

We also sought to explain how the district court judge got it wrong when it comes to fair use. One of the key findings in the lower court decision was that loans of CDL scans were direct substitutes for loans of licensed ebooks. We explained that this is not the case: a CDL scan is not the same thing as an ebook, they look different and have different functions and features. And CDL scans can be resources for authors conducting research in some key ways that licensed ebooks cannot. Out of print books and older editions of books are often available as CDL scans but not licensed ebooks, for example.

Another issue from the district court opinion that we addressed was the judge’s finding that IA’s use of the works in question was “commercial.” We strongly disagreed with this conclusion: borrowing a CDL scan from IA’s Open Library is free, and the organization—which is also a nonprofit—actually bears a lot of expenses related to digitization. Moreover, the publishers had failed to establish any concrete financial harm they had suffered as a result of IA’s CDL program. We discussed a recent lawsuit in the D.C. Circuit, ASTM v. PRO, to further push back on the district court’s conclusion on commerciality. 

You can read our brief for yourself here, or find it embedded at the bottom of this post. In the new year, you can expect another post or two with more details about our amicus brief and the other amicus briefs that have been, or soon will be, submitted in this case.

What’s Next?

Earlier this week, the publishers proposed that they file their own brief on March 15, 2024—91 days after Internet Archive filed its opening brief. The court’s rules stipulate that any amici supporting the publishers file their briefs within seven days of the publishers’ filing. Then, the parties can decide to submit reply briefs, and will notify the court of their intent to do so. Finally, the parties can choose to request oral argument, though the court might still decide to decide the case “on submission,” i.e., without oral argument. If the case does proceed to oral argument, a three-judge panel will hear from attorneys for each side before rendering their decision. We expect the process to extend into mid-2024, but it can take quite a while for appeals courts to actually hand down their decision. We’ll keep our readers apprised of any updates as the case moves forward.

Authors-Alliance-Second-Circuit-Amicus-Brief_Filed

Authors Alliance Releases New Legal Guide to Writing About Real People

Posted December 5, 2023

We are delighted to announce the publication of our brand new guide, the Authors Alliance Guide to Writing About Real People, a legal guide for authors writing nonfiction works about real people. The guide was written by students in two clinical teams at the UC Berkeley Samuelson Law and Public Policy Clinic—Lily Baggott, Jameson Davis, Tommy Ferdon, Alex Harvey, Emma Lee, and Daniel Todd—as well as clinical supervisors Jennifer Urban and Gabrielle Daley, along with Authors Alliance’s Senior Staff Attorney, Rachel Brooke. The guide was edited by Executive Director Dave Hansen and former Executive Director, Brianna Schofield. This long list of names is a testament to the fact that it took a village to create this guide, and we are so excited to finally share it with our members, allies, and any and all authors who need it. You can read and download our guide here

On Thursday, we are hosting a webinar about our guide, where Authors Alliance staff will share more about what went into producing it, those who partnered with us or supported the guide, and the particulars of the guide’s contents. Sign up here!

The Writing About Real People guide covers several different legal issues that can arise for authors writing about real people in nonfiction books like memoirs, biographies, and other narrative nonfiction projects. The issues it addresses are “causes of action” (or legal theories someone might sue under) based on state law. The requirements and considerations involved vary from state to state, so the guide highlights trends and commonalities among states. Throughout the guide, we emphasize that even though these causes of action might sound scary, the First Amendment to the U.S. Constitution in most cases empowers authors to write freely about topics of their choosing. The causes of action in this guide are exceptions to that rule, and each of them is limited in their reach and scope by the First Amendment’s guarantees. 

False Statements and Portrayals

The first section in the Writing About Real People guide concerns false statements and portrayals. This encompasses two different causes of action: defamation and false light. 

You have probably heard of defamation: it’s one of the most common causes of action related to writing about a real person. Defamation occurs when someone makes a false statement about another person that injures that person’s reputation, when the statement is made with some degree of “fault.” The level of fault required turns on what kind of person the statement is made about. For public people—people with some renown or governmental authority—the speaker must exercise “actual malice,” or reckless disregard as to whether the statement is true. But for private people, a speaker must be negligent as to whether the statement was true, meaning that the speaker failed to take an ordinary amount of care in verifying the veracity of the statement. An author might expose themselves to defamation liability if they write something untrue about another person in their published work that is held up as factual, that statement injures a person’s reputation, and the author failed to take the requisite level of care to ensure that the statement was factual. 

False light is similar to defamation, and many states do not recognize false light since these causes of action are so similar. Where defamation concerns false statements represented as factual, false light concerns false portrayals. It can occur when a speaker creates a misleading impression about a subject, through implication or omission, by example. Like defamation, false light requires fault on the part of the speaker, and the public person/private person standards are the same as for defamation. 

Invasions of Privacy

The second section in the Writing About Real People guide concerns invasions of privacy, or violations of a person’s rights to privacy. This covers two related causes of action: intrusion on seclusion and public disclosure of private facts. 

Intrusion on seclusion occurs when someone intentionally intrudes on another’s private place or affairs in a way that is highly offensive—judged by the perspective of an ordinary, reasonable person. For authors, intrusion on seclusion can arise when an author uses research or information-gathering methods that are invasive. This could include things like entering someone’s home without permission or digging through personal information like health or banking records without permission. Intrusion on seclusion might be an issue for authors during the research and writing stages of their processes, not when the work is actually published, as is the case with other causes of action in this guide.

Public disclosure of private facts occurs when someone makes private facts about a person public, when that disclosure is highly offensive and made with some degree of fault, and when the information disclosed doesn’t relate to a matter of public concern. Essentially, public disclosure of private facts liability exists to address situations where a speaker shares highly private information about a person that the public has no interest in knowing about, and the subject suffers as a result. Like defamation and false light, the level of fault required for a speaker to be liable depends on whether the subject is a public or private person, and these levels are the same as for defamation (actual malice for public people, and negligence for private people). This means that authors have much more leeway to share private information about public people than private people. And the “public concern” piece provides even more protection for speech about public people. 

Right of Publicity and Identity Rights

The third section in the Writing About Real People Guide concerns the right of publicity and unauthorized use of identity. Violations of the right of publicity, or unauthorized uses of identity, can occur when someone uses another person’s identity in a way that is “exploitative” and derives a benefit from that use. Importantly for authors, this excludes merely writing about someone in a book, article, or other piece of writing. The right of publicity is mostly concerned with commercial uses, like using someone’s name or likeness to sell a product without permission, but it can also apply to non-commercial uses that are exploitative, like using someone’s identity to generate attention for a work. In most cases, the right of publicity involves uses of someone’s image or likeness rather than just evoking their identity in text, but this is not necessarily the case. This section might be informative for authors who want to use someone’s image on their book cover or evoke an identity in advertising, but most authors merely writing nonfiction text about a real person do not have to worry too much about the right of publicity. 

Practical Guidance

A final section in our guide covers practical guidance for authors on how to avoid legal liability for the causes of action discussed in the guide in ways that are simple to understand and implement. Using reliable research methods and sources, obtaining consent from subjects where that is practicable, and carefully documenting your research and sources can go a long way towards helping you avoid legal liability while still empowering you to write freely.