Tag Archives: Artificial Intelligence

Artificial Intelligence, Authorship, and the Public Interest

Posted January 9, 2025
Photo by Robert Anasch on Unsplash

Today, we’re pleased to announce a new project generously supported by the John S. and James L. Knight Foundation. The project, “Artificial Intelligence, Authorship, and the Public Interest,” aims to identify, clarify, and offer answers to some of the most challenging copyright questions posed by artificial intelligence (AI) and explain how this new technology can best advance knowledge and serve the public interest.

Artificial intelligence has dominated public conversation about the future of authorship and creativity for several years. Questions abound about how this technology will affect creators’ incentives, influence readership, and what it might mean for future research and learning. 

At the heart of these questions is copyright law. Over two dozen class-action copyright lawsuits have been filed between November 2022 and today against companies such as Microsoft, Google, OpenAI, Meta, and others. Additionally, congressional leadership, state legislatures, and regulatory agencies have held dozens of hearings to reconcile existing intellectual property law with artificial intelligence. As one of the primary legal mechanisms for promoting the “progress of science and the useful arts,” copyright law plays a critical role in creating, producing, and disseminating information. 

We are convinced that how policymakers shape copyright law in response to AI will have a lasting impact on whether and how the law supports democratic values and serves the common good. That is why Authors Alliance has already devoted considerable effort to these issues, and this project will allow us to expand those efforts at this critical moment. 

AI Legal Fellow
As part of the project, we’re pleased to add an AI Legal Fellow to our team to support the project. The position requires a law degree and demonstrated interest and experience with artificial intelligence, intellectual property, and legal technology issues. We’re particularly interested in someone with a demonstrated interest in how copyright law can serve the public interest. This role will require significant research and writing. Pay is $90,000/yr, and it is a two-year term position. Read more about the position here. We’ll begin reviewing applications immediately and do interviews on a rolling basis until filled. 

As we get going, we’ll have much more to say about this project. We will have some funds available to support research subgrants, organize several workshops and symposia, and offer numerous opportunities for public engagement. 

About the John S. and James L. Knight Foundation
We are social investors who support democracy by funding free expression and journalism, arts and culture in community, research in areas of media and democracy, and in the success of American cities and towns where the Knight brothers once had newspapers. Learn more at kf.org and follow @knightfdn on social media.

Restricting Innovation: How Publisher Contracts Undermine Scholarly AI Research

Posted December 6, 2024
Photo by Josh Appel on Unsplash

This post is by Rachael Samberg, Director, Scholarly Communication & Information Policy, UC Berkeley Library and Dave Hansen, Executive Director, Authors Alliance

This post is about the research and the advancement of science and knowledge made impossible when publishers use contracts to limit researchers’ ability to use AI tools with scholarly works. 

Within the scholarly publishing community, mixed messages pervade about who gets to say when and how AI tools can be used for research reliant on scholarly works like journal articles or books. Some scholars voiced concern (explained more here) when major scholarly publishers like Wiley or Taylor & Francis entered lucrative contracts with big technology companies to allow for AI training without first seeking permission from authors. We suspect that these publishers have the legal right to do so since most publishers demand that authors hand over extensive rights in exchange for publishing their work. And with the backdrop of dozens of pending AI copyright lawsuits, who can blame the AI companies for paying for licenses, if for no other reason than avoiding the pain of litigation? While it stings to see the same large commercial, academic publishers profit yet again off of the work academic authors submit to them for free, we continue to think there are good ways for authors to retain a say in the matter. 

 Big tech companies are one thing, but what about scholarly research? What about the large and growing number of scholars who are themselves using scholarly copyrighted content with AI tools to conduct their research? We currently face a situation in which publishers are attempting to dictate how and when researchers can do that work, even when authors’ fair use rights to use and derive new understandings from scholarship clearly allow for such uses. 

How vendor contracts disadvantage US researchers

We have written elsewhere (in an explainer and public comment to the Copyright Office) why training AI tools, particularly in the scholarly and research context, constitutes a fair use under U.S. Copyright law. Critical for the advancement of knowledge, training AI is based on a statutory right already held by all scholarly authors engaging in computational research and one that lawmakers should preserve. 

The problem U.S. scholarly authors presently face with AI training is that publishers restrict their access to these statutory rights through contracts that override them: In the United States, publishers can use private contracts to take away statutory fair use rights that researchers would otherwise hold under Federal law. In this case, the private contracts at issue are the electronic resource (e-resource) license agreements that academic research libraries sign to secure campus access to electronic journal, e-book, data, and other content that scholars need for their computational research.

Contractual override of fair use is a problem that disparately disadvantages U.S. researchers. As we have described elsewhere, more than forty countries, including the European Union, expressly reserve text mining and AI training rights for scientific research by research institutions. Not only do scholars in these countries not have to worry whether their computational research with AI is permitted, but also: They do not risk having those reserved rights overridden by contract. The European Union’s Copyright Digital Single Market Directive and recent AI Act nullify any attempt to circumscribe the text and data mining and AI training rights reserved for scientific research within research organizations. U.S. scholars are not as fortunate. 

In the U.S., most institutional e-resource licenses are negotiated and managed by research libraries, so it is imperative that scholars work closely with their libraries and advocate to preserve their computational research and AI training rights within the e-resource license agreements that universities sign. To that end, we have developed adaptable licensing language to support institutions in doing that nationwide. But while this language is helpful, the onus of advocacy and negotiation for those rights in the contracting process remains. Personally, we have found it helpful to explain to publishers that they must consent to these terms in the European Union, and can do so in the U.S. as well. That, combined with strong faculty and administrative support (such as at the University of California), makes for a strong stance against curtailment of these rights.

But we think there are additional practical ways for libraries to illustrate—both to publishers and scholarly authors—exactly what would happen to the advancement of knowledge if publishers’ licensing efforts to curtail AI training were successful. One way to do that is by “unpacking” or decoding a publisher’s proposed licensing restriction, and then demonstrating the impact that provision would have on research projects that were never objectionable to publishers before, and should not be now. We’ll take that approach below.

Decoding a publisher restriction

A commercial publisher recently proposed the following clause in an e-resource agreement:

Customer [the university] and its Authorized Users [the scholars] may not:

  1. directly or indirectly develop, train, program, improve, and/or enrich any artificial intelligence tool (“AI Tool”) accessible to anyone other than Customer and its Authorized Users, whether developed internally or provided by a third party; or
  2. reproduce or redistribute the Content to any third-party AI Tool, except to the extent limited portions of the Content are used solely for research and academic purposes (including to train an algorithm) and where the third-party AI Tool (a) is used locally in a self-hosted environment or closed hosted environment solely for use by Customer or Authorized Users; (b) is not trained or fine-tuned using the Content or any part thereof; and (c) does not share the Content or any part thereof with a third party.  

What does this mean?

  • The first paragraph forbids the training or improving of any AI tool if it’s accessible or released to third parties. And, it further forbids the use of any computational outputs or analysis that are derived from the licensed content from being used to train any tool available to third parties. 
  • The second paragraph is perhaps even more concerning. It provides that when using third party AI tools of any kind, a scholar can use only limited portions of the licensed content with the tools, and are prohibited from doing any training at all of third party tools even if it’s a non-generative AI tool and the scholar is performing the work in a completely closed and highly secure research environment.

What would the impact of such a restrictive licensing provision be on research? 

It would mean that every single one of the trained tools in the following projects could never be disseminated. In addition, for the projects below that used third-party AI tools, the research would have been prohibited full-stop because the third-party tools in those projects required training which the publisher above is attempting to prevent:

Tools that could not be disseminated

  1. In 2017, chemists created and trained a generative AI tool on 12,000 published research papers regarding synthesis conditions for metal oxides, so that the tool could identify anticipated chemical outputs and reactions for any given set of synthesis conditions entered into the tool. The generative tool they created is not capable of reproducing or redistributing any licensed content from the papers; it has merely learned conditions and outcomes and can predict chemical reactions based on those conditions and outcomes. And this beneficial tool would be prohibited from dissemination under the publisher’s terms identified above.
  2. In 2018, researchers trained an AI tool (that they had originally created in 2014) to understand whether a character is “masculine” or “feminine” by looking at the tacit assumptions expressed in words associated with that character. That tool can then look at other texts and identify masculine or feminine characters based on what it knows from having been trained before. The implications are that scholars can therefore use texts from different time periods with the tool to study representations of masculinity and femininity over time. No licensed content, no licensed or copyrighted books from a publisher can ever be released to the world by sharing the trained tool; the trained tool is merely capable of topic modeling—but the publisher’s above language would prohibit its dissemination nevertheless. 

Tools that could neither be trained nor disseminated 

  1. In 2019, authors used text from millions of books published over 100 years to analyze cultural meaning. They did this by training third-party non-generative AI word-embedding models called Word2Vec and GLoVE on multiple textual archives. The tools cannot reproduce content: when shown new text, they merely represent words as numbers, or vectors, to evaluate or predict how similar words in a given space are semantically or linguistically. The similarity of words can reveal cultural shifts in understanding of socioeconomic factors like class over time. But the publisher’s above licensing terms would prohibit the training of the tools to begin with, much less the sharing of them to support further or different inquiry. 
  2. In 2023, scholars trained a third-party-created open-source natural language processing (NLP) tool called Chemical Data Extractor (CDE). Among other things, CDE can be used to extract chemical information and properties identified in scholarly papers. In this case, the scholars wanted to teach CDE to parse a specific type of chemical information: metal-organic frameworks, or MoFs. Generally speaking, the CDE tool works by breaking sentences into “tokens” like parts of speech and referenced chemicals. By correlating tokens, one can determine that a particular chemical compound has certain synthetic properties, topologies, reactions with solvents, etc. The scholars trained CDE specifically to parse MoF names, synthesis methods, inorganic precursors, and more—and then exported the results into an open source database that identifies the MoF properties for each compound. Anyone can now use both the trained CDE tool and the database of MoF properties to ask different chemical property questions or identify additional MoF production pathways—thereby improving materials science for all. Neither the CDE tool nor the MoF database reproduces or contains the underlying scholarly papers that the tool learned from. Yet, neither the training of this third-party CDE tool nor its dissemination would be permitted under the publisher’s restrictive licensing language cited above.

Indeed, there are hundreds of AI tools that scholars have trained and disseminated—tools that do not reproduce licensed content—and that scholars have created or fine-tuned to extract chemical information, recognize faces, decode conversations, infer character types, and so much more. Restrictive licensing language like that shown above suppresses research inquiries and societal benefits that these tools make possible. It may also disproportionately affect the advancement of knowledge in or about developing countries, which may lack the resources to secure licenses or be forced to rely on open-source or poorly-coded public data—hindering journalism, language translation, and language preservation.

Protecting access to facts

Why are some publishers doing this? Perhaps to reserve the opportunity to develop and license their own scholarship-trained AI tools, which they could then license at additional cost back to research institutions. We could speculate about motivations, but the upshot is that publishers have been pushing hard to foreclose scholars from training and dissemination AI tools that now “know” something based on the licensed content. That is, such publishers wish to prevent tools from learning facts about the licensed content. 

However, this is precisely the purpose of licensing content. When institutions license content for their scholars to read, they are doing so for the scholars to learn information from the content. When scholars write about it or teach about the content, they are not regenerating the actual expression from the content—the part that is protected by copyright; rather the scholars are conveying the lessons learned from the content—facts not protected by copyright. Prohibiting the training of AI tools and the dissemination of those tools is functionally equivalent to prohibiting scholars from learning anything about the content that institutions are licensing for that very purpose, and that scholars have written to begin with! Publishers should not be able to monopolize the dissemination of information learned from scholarly content, and especially when that information is used non-commercially.

For these reasons, when we negotiate to preserve AI usage and training rights, we generally try to achieve the following outcomes which would promote—rather than prohibit—all of the research projects described above:

The sample language we’ve disseminated empowers others to negotiate for these outcomes. We hope that, when coupled with the advocacy tools we’ve provided above, scholars and libraries can protect their AI usage and training rights, while also being equipped to consider how they want their own works to be used.

Developing a public-interest training commons of books

Posted December 5, 2024
Photo by Zetong Li on Unsplash

Authors Alliance is pleased to announce a new project, supported by the Mellon Foundation, to develop an actionable plan for a public-interest book training commons for artificial intelligence. Northeastern University Library will be supporting this project and helping to coordinate its progress.

Access to books will play an essential role in how artificial intelligence develops. AI’s Large Language Models (LLMs) have a voracious appetite for text, and there are good reasons to think that these data sets should include books and lots of them. Over the last 500 years, human authors have written over 129 million books. These volumes, preserved for future generations in some of our most treasured research libraries, are perhaps the best and most sophisticated reflection of all human thinking. Their high editorial quality, breadth, and diversity of content, as well as the unique way they employ long-form narratives to communicate sophisticated and nuanced arguments and ideas make them ideal training data sources for AI.

These collections and the text embedded in them should be made available under ethical and fair rules as the raw material that will enable the computationally intense analysis needed to inform new AI models, algorithms, and applications imagined by a wide range of organizations and individuals for the benefit of humanity. 

Currently, AI development is dominated by a handful of companies that, in their rush to beat other competitors, have paid insufficient attention to the diversity of their inputs, questions of truth and bias in their outputs, and questions about social good and access. Authors Alliance, Northeastern University Library, and our partners seek to correct this tilt through the swift development of a counterbalancing project that will focus on AI development that builds upon the wealth of knowledge in nonprofit libraries and that will be structured to consider the views of all stakeholders, including authors, publishers, researchers, technologists, and stewards of collections. 

The main goal of this project is to develop a plan for either establishing a new organization or identifying the relevant criteria for an existing organization (or partnership of organizations) to take on the work of creating and stewarding a large-scale public interest training commons of books.

We seek to answer several key questions, such as: 

  • What are the right goals and mission for such an effort, taking into account both the long and short-term;
  • What are the technical and logistical challenges that might differ from existing library-led efforts to provide access to collections as data;
  • How to develop a sufficiently large and diverse corpus to offer a reasonable alternative to existing sources;
  • What a public-interest governance structure should look like that takes into account the particular challenges of AI development;
  • How do we, as a collective of stakeholders from authors and publishers to students, scholars, and libraries, sustainably fund such a commons, including a model for long-term sustainability for maintenance, transformation, and growth of the corpus over time;
  • Which combination of legal pathways is acceptable to ensure books are lawfully acquired in a way that minimizes legal challenges;
  • How to respect the interests of authors and rightsholders by accounting for concerns about consent, credit, and compensation; and
  • How to distinguish between the different needs and responsibilities of nonprofit researchers, small market entrants, and large commercial actors.

The project will include two meetings during 2025 to discuss these questions and possible ways forward, additional research and conversations with stakeholders, and the development and release of an ambitious yet achievable roadmap.

Book Talk: The Line: AI and the Future of Personhood

Join us for a book talk with legal scholar JAMES BOYLE, discussing his book THE LINE: AI and the Future of Personhood, in conversation with KATE DARLING of the MIT Media Lab.

November 19th at 10am PT / 1pm ET
REGISTER NOW!

Chatbots like ChatGPT have challenged human exceptionalism: we are no longer the only beings capable of generating language and ideas fluently. But is ChatGPT conscious? Or is it merely engaging in sophisticated mimicry? And what happens in the future if the claims to consciousness are more credible? In The Line, James Boyle explores what these changes might do to our concept of personhood, to “the line” we believe separates our species from the rest of the world, but also separates “persons” with legal rights from objects.

The personhood wars—over the rights of corporations, animals, over the question of when life begins and ends—have always been contentious. We’ve even denied the personhood of members of our own species. How will those old fights affect the new ones, and vice versa? Boyle pursues those questions across a dizzying array of fields. He discusses moral philosophy and science fiction, transgenic species, nonhuman animals, the surprising history of corporate personality, and AI itself. Engaging with empathy and anthropomorphism, courtroom battles on behalf of chimps, and doom-laden projections about the threat of AI, The Line offers fascinating and thoughtful answers to questions about our future that are arriving sooner than we think.

You can find a free, open access copy of The Line here, and you can purchase a copy directly from MIT Press here.

ABOUT OUR SPEAKERS

JAMES BOYLE is the William Neal Reynolds Professor of Law at Duke Law School, founder of the Center for the Study of the Public Domain, and former Chair of Creative Commons. He is the author of The Public Domain and Shamans, Software, and Spleens, the coauthor of two comic books, and the winner of the Electronic Frontier Foundation’s Pioneer Award for his work on digital civil liberties.

DR. KATE DARLING is a Research Scientist at the MIT Media Lab and leads the Robotics Ethics & Society research team at the Boston Dynamics AI Institute. She is the author of The New Breed: What Our History with Animals Reveals about Our Future with Robots. Kate’s work explores the intersections of robotic technology and society, with a particular interest in legal, economic, social, and ethical issues.

Join Us! November 19th, 10am PT / 1pm ET

REGISTER NOW!

Copyright Management Information, 1202(b), and AI

Posted October 30, 2024

This post is by Maria Crusey, a third-year law student at Washington University in St. Louis. Maria has been working with Authors Alliance this semester on a project exploring legal claims in the now 30+ pending copyright AI lawsuits. 

In the recent spate of copyright infringement lawsuits against AI developers, many plaintiffs allege violations of 17 U.S.C. § 1202(b) in their use of copyrighted works for training and development of AI systems.  

Section 1202(b) prohibits the “removal or alteration of copyright management information.” Compared to related provisions in 17 U.S.C. § 1201, which protects against circumvention of copyright protection systems, §1202(b) has seldom been litigated at the appellate level, and there’s a growing divide among district courts about whether §1202(b) should apply to derivative works, particularly those created using AI technology.

At first glance, §1202(b) appears to be a straightforward provision. However, the uptick in §1202(b) claims raises some challenging questions, namely: How does §1202(b) apply to the use of a copyrighted work as part of a dataset that must be cleaned, restructured, and processed in ways that separate copyright management information from the content itself? And how should 1202(b) apply to AI systems that may reproduce small portions of content contained in training data?  Answers to this question may have serious implications in the AI suits because violations of 1202(b) can come with hefty statutory damage awards – between $2,500 and $25,000 for each violation. Spread across millions of works, the damages could be staggering. How the courts resolve this issue could also impact many other reuses of copyrighted works–from analogous uses such as text data mining research to much more routine re-distribution of copyrighted works in other contexts. 

One of these AI cases has requested that the Ninth Circuit Court of Appeals accept an interlocutory appeal on just this issue, and we are waiting to see whether the court will accept it.

For an introduction to §1202(b) and observations on this question, among others, read on:

What is § 1202(b) and what is it intended to do?

Broadly, 17 U.S.C. § 1202 is a provision of the Digital Millennium Copyright Act (DMCA) that protects the integrity of copyright management information (“CMI”). Per §1202(c), CMI comprises certain information identifying a copyrighted work, often including the title, the name of the author, and terms and conditions for the use of a work.

Section 1202(b) forbids the alteration or removal of copyright management information. The section provides that:

“[n]o person shall, without the authority of the copyright owner or the law – 

(1) intentionally remove or alter any CMI,

(2) distribute or import for distribution CMI knowing that the CMI has been removed or altered without authority of the copyright owner or the law, or 

(3) distribute, import for distribution, or publicly perform works, copies of works or phonorecords, knowing that copyright management information has been removed or altered without authority of the copyright owner or the law, knowing, or with respect to civil remedies under section 1203, having reasonable grounds to know that it will induce, enable, facilitate, or conceal an infringement of any right under this title.”

17 U.S.C. § 1202(b).

Congress primarily aimed to limit the assistance and enablement of copyright infringement in its enactment of §1202(b). This purpose is evident in the legislative history of the provision. In an address to a congressional subcommittee prior to the adoption of the DMCA, the then–Register of Copyrights, Marybeth Peters, discussed the aims of §1202(b). First, Peters noted that the requirements of §1202(b) would make CMI more reliable and thus aid in the administrability of copyright law. Second, Peters stated that §1202(b) would help prevent instances of copyright infringement that could come from the removal of CMI. The idea is if a copyrighted work lacks CMI, there is a greater likelihood of infringement since others may use the work under the pretense that they are the author or copyright holder. In creating a statutory violation for a party’s removal of CMI, regardless of later infringing activity, §1202(b) functions as damage control against potential copyright infringement.

What are the essential elements of a § 1202(b) claim?

To have a claim under §1202(b), a plaintiff must allege particularized facts about the existence and alteration or removal of CMI. Additionally, some courts require a plaintiff to demonstrate that the defendant had knowledge that the CMI was being altered or removed and that the alteration or removal would enable copyright infringement. Finally, some courts have required plaintiffs to show that the work with the altered or removed CMI is an exact copy of the original work–what has become known as the “identicality” requirement. This last “identicality” requirement is one of the main issues in the AI lawsuits raising §1202(b) and is detailed further below.

→ The “Identicality” Requirement

Courts that have imposed “identicality” have required that plaintiffs demonstrate that the work with the removed CMI is an exact copy of the original work and thus is “identical,” except for the missing or altered CMI. 

Suppose, for example, a photographer owns the copyright to a photograph they took. The photographer adds CMI to the photograph and takes care to protect the integrity of the work as it is dispersed online. A third party captures the photograph posted on a website by taking a screenshot and removes the CMI from the copied image while keeping all other aspects of the original photograph the same. The screenshot with the removed CMI is an “exact copy” of the original photograph because the only difference between the copyrighted photograph and the screenshot is the removal of the CMI.

Federal courts are divided in imposing the identicality requirement for §1202(b) claims, though the circuit courts have not yet addressed the issue. Notably, district courts of the Ninth Circuit Court of Appeals have varied in their treatments of the identicality requirement. For example, the court for the District of Nevada in Oracle v. Rimini Street declined to impose the identicality requirement because the requirement may weaken the intended protections for copyright holders under §1202(b). Conversely, in Kirk Kara Corp. v. W. Stone & Metal Corp., a court in the Central District of California applied the identicality requirement, though it provided little explanation for why it adopted it. Application of the identicality requirement is also unsettled in district courts beyond the Ninth Circuit (see, for example, this Southern District of Texas case discussing at length the identicality requirement and rejecting it). 

What are the §1202(b) claims at issue in the present suits?

The claims in Doe 1 v. Github exemplify the §1202(b) issues common among the present suits, and it is the Github suit that is presently before the Ninth Circuit Court of Appeals to take, if it wishes, on appeal.  

In Github, owners of copyrights in software code brought a suit against GitHub, a software developer platform. The plaintiffs alleged that Microsoft Copilot, an AI product developed in part by GitHub, illegally removed CMI from their works. The plaintiffs stored their software in GitHub’s publicly accessible software repositories under open-source license agreements. The plaintiffs claimed that GitHub removed CMI from their code and trained the Copilot AI model on the code in violation of the license agreements. Moreover, the plaintiffs claimed that, when prompted to generate software code, Copilot includes unique aspects of the plaintiffs’ code in its outputs. In their complaint, the plaintiffs alleged that all requirements for a valid § 1202(b) claim were met in the present suit. The plaintiffs stressed that, in removing CMI, the defendants failed to prevent users of products from making non-infringing use of the product. Consequently, they claim, the defendants removed the CMI, knowing that it would “induce, enable, facilitate, and/or conceal infringement” of copyrights in violation of the DMCA.

Regarding the §1202(b) claims, the parties contest the application of the identicality requirement. The plaintiffs first argue that § 1202 contains no such requirement: “The plain language of DMCA § 1202 makes it a violation to remove or alter CMI. It does not require that the output work be original or identical to obtain relief. . . By a plain reading of the statute, there is no need for a copy to be identical—there only needs to be copying, which Plaintiffs have amply alleged.” 

As a backstop, the plaintiffs further argue that Copilot does produce “near-identical reproduction[s]” of their copyrighted code and allege this is sufficient to fulfill the identicality requirement under §1202(b). Specifically, plaintiffs claimed that Copilot generates parts of plaintiffs’ code in extra lines of output code that are not relevant to input prompts. Plaintiffs also claimed Copilot generates their code in output code that produces errors due to a mismatch between the directly copied code and the code that would actually fit the prompt. To make this assertion work, plaintiffs distinguish their version of “identicality” –semantically equivalent lines of code–from a reproduction of the whole work. They argue that the defendant’s position, that “the reproduction of short passages that may be part of [a] larger work, rather than the reproduction of an entire work, is insufficient to violate Section 1202,” would lead to absurd results. “By OpenAI’s logic, a party could copy and distribute a fragment of a copyrighted work—say, a chapter of a book, a stanza of a poem, or a scene from a movie—and face no repercussions for infringement.” 

 In their reply, the defendants countered that §1202, which defines CMI as relating to a “copy of a work,” requires a complete and identical copy, not just snippets. Defendants noted that the plaintiffs have conceded that Copilot reproduces only snippets of code rather than complete versions of the code. Therefore, the defendants argue, Copilot does not create “identical copies” of the plaintiffs’ complete copyrighted works. The argument is based on both the text of the statute (they note that the statute only provides for liability when distributing copies that CMI has been stripped from, not derivatives, abridgments, or other adaptations), and they bolster those arguments by suggesting that allowing 1202 claims for incomplete copies would create chaos for ordinary uses of copyrighted works: “On Plaintiffs’ reading of § 1202, if someone opened an anthology of poetry and typed up a modified version of a single “stanza of a poem,” . . . without including the anthology’s copyright page, a § 1202(b) claim would lie. Plaintiffs’ reading effectively concedes that they are attempting to turn every garden-variety claim of copyright infringement into a DMCA claim, only without the usual limitations and defenses applicable under copyright law. Congress intended no such thing.” 

The GitHub court has addressed the issue now several times: it initially dismissed the plaintiffs’ §1202(b)(1) and (b)(3) claims, subsequently denied the plaintiffs’ motion for reconsideration of the claims, allowed the plaintiffs to amend their complaint and try again with more specificity, then dismissed the claims again. The reasoning of the court has been consistent, and largely focused on insufficient allegations of identicality. The court agreed with Defendants that the identicality requirement should apply and that the snippets do not satisfy the requirement. Following the dismissal, the plaintiffs sought and received permission from the district court to file an interlocutory appeal (an appeal on a specific issue before the case is fully resolved– something not usually allowed) to the Court of Appeals for the Ninth Circuit to determine whether § 202(b)(1) and (b)(3) impose an identicality requirement. The Ninth Circuit is presently considering whether to hear the appeal.

What would the Ninth Circuit assess in the appeal, and what are the implications of the appeal for future lawsuits?

If the appeal is accepted, the Ninth Circuit will determine whether §1202(b)(1) and (b)(3) actually impose an identicality requirement. Moreover, with regard to the facts of the Github case, the court will decide whether the identicality requirement requires exact copying of a complete copyrighted work, or perhaps something less. The Ninth Circuit’s hearing of this appeal would be notable for a number of reasons.

First, as mentioned above, §1202(b) is largely unaddressed by the circuit courts, and explicit appellate guidance has only been provided for the knowledge requirement referenced above. Consequently, determinations of §1202(b) claims are largely informed by varying district court decisions that are binding only on the parties to the suits and provide inconsistent interpretations of the requirements for a claim under the provision. An appellate ruling that accepts or rejects the identicality requirement would create additional binding authority to further clarify courts’ interpretations of §1202(b).

Second, a ruling on the identicality requirement from the Ninth Circuit specifically would be notable because it would be binding on the large number of §1202(b) claims presently being litigated in the Ninth Circuit’s lower courts. And, given the centrality of AI developers operating in California and elsewhere in the Ninth Circuit, the outcome of the appeal would significantly impact future lawsuits that involve §1202(b) claims.

It is hard to predict how the Ninth Circuit might rule, but we can work through some of the implications of the choices the court would have before it: 

If the Ninth Circuit interprets the identicality requirement as requiring a complete and exact copy, it would impose a high standard for the requirement and plaintiffs would likely be constrained in their ability to bring §1202(b) claims. If the court did this, the Github plaintiffs’ claims would likely fail as the alleged copied snippets of code generated by Copilot are not exact copies and do not comprise the complete copyrighted works. This hypothetical standard would be advantageous for individuals who remove CMI from copyrighted works in the course of processing them using AI as well as those who deploy AI systems that produce small portions of content similar (but not exactly so) to inputs.  So long as the works being processed or distributed are not complete exact copies, individuals would be free to alter the CMI of the works for ease in analyzing the copyrighted information. 

Alternatively, the Ninth Circuit could adopt a loose interpretation of identicality in which incomplete and inexact copying would be sufficient. One approach would be to require identicality but not copying of the entire work (something the plaintiffs in the Github suit advocate for). How the parties or the Ninth Circuit would formulate what standard would apply to this “less than entire” but still “near identical” standard is hard to say, but presumably, plaintiffs would have an easier time alleging facts sufficient for a §1202(b) claim. Applied to Github, it still seems unclear that the copied snippets of the plaintiffs’ code in the Copilot outputs could pass muster (this is likely a factual question to be determined at later stages of the litigation). But it could allow claims to at least survive an early motion to dismiss. As such, the adoption of this standard could limit how AI developers engage with works but also potentially affect others, such as researchers using similar techniques to process, clean, and distribute small portions of copyrighted works as part of a dataset.

Finally, the Ninth Circuit may decide to do away with the identicality requirement altogether. While this may seem like a potential boon to plaintiffs, who could allege that removal of CMI and distribution of some copied material, no matter how small, plaintiffs would still face substantial challenges.  Elimination of the identicality requirement would likely lead to greater weight being placed on the knowledge requirement in courts’ assessments of §1202(b) claims, which requires that defendants know or have reasonable grounds to know that their actions will “induce, enable, facilitate, or conceal an infringement.” In the context of the Github case, even without an identicality requirement, plaintiffs §1202(b) claims contain scant factual allegations about the defendants’ CMI removal and knowledge in the court filings to date. For other developers and users of AI, the effects of not having an identicality requirement would likely vary on a case-by-case basis. 

Conclusion

Recent copyright infringement suits and the pending appeal to the Ninth Circuit in Doe 1 v. Github demonstrate that §1202(b) is having its day in the sun. Although the provision has been overlooked and infrequently litigated in the past, the scope of protections granted by §1202(b) is important for understanding whether and how AI developers can remove CMI when using copyrighted works to process, restructure, and analyze copyrighted works for AI development. Thus, as lawsuits against AI developers and users continue to progress, the requirements to have a valid §1202(b) claim are sure to become even more contentious.

Who Represents You in the AI Copyright Lawsuits? 

Posted October 16, 2024

Sara Silverman is the author of The Bedwetter, a comedy memoir.  Richard Kadrey wrote Sandman Slim, a fantasy novel series. Christopher Golden, a supernatural thriller titled Ararat. 

These authors might not seem to have much in common with an academic author who writes in history, physics, or chemistry. Or a journalist. Or a poet. Or, for that matter, me, writing this blog post.  And yet, these authors may end up representing us all in court. 

A large number of the recent AI copyright lawsuits are class action lawsuits. This means that these lawsuits are brought by a small number of plaintiffs who (subject to judicial approval) are granted the right to represent a much larger class. In many of the AI copyright lawsuits,  the proposed classes are extraordinarily broad, including many creators who might be surprised that they are being represented. If you live in the US and wrote something that was published online, there is a good chance that you are included in multiple of these classes. 

A very brief background on class action lawsuits

Class actions can be an efficient way of resolving disputes that involve lots of people, allowing for a single resolution that binds many parties when there are common interests and facts. As you can imagine, the class action mechanism can also attract misuse, for example, by plaintiffs (and their attorneys) who may seek large settlements on behalf of a large number of people. Those settlements may benefit the named plaintiffs and their attorneys but they aren’t really aligned with the interests of most class members. 

There are rules in place to prevent that kind of abuse.  In federal courts (where all copyright lawsuits must be brought), Rule 23 of the Federal Rule of Civil Procedure governs. It provides that:

“One or more members of a class may sue or be sued as representative parties on behalf of all members only if: 
(1) the class is so numerous that joinder of all members is impracticable; [“numerosity”]
(2) there are questions of law or fact common to the class; [“commonality”]
(3) the claims or defenses of the representative parties are typical of the claims or defenses of the class; [“typicality”] and
(4) the representative parties will fairly and adequately protect the interests of the class. [“adequacy”]”

The rest of Rule 23 contains a number of other safeguards to protect both class members and defendants. Among them are requirements that the court must certify that the class complies with rule 23,  that any proposed settlements be approved by the court,  and that class members receive notice of any proposed settlement and an opportunity to object. Additionally, there are a number of rules to ensure that the law firm bringing the suit can fairly and competently represent the class members. 

Class definition and class representatives in the copyright AI lawsuits

We believe it’s important for creators to pay attention to these suits because if a class is certified and that class includes those creators, the class representatives will have meaningful legal authority to speak on their behalf.  

Rule 23  provides that “at an early practicable time after a person sues,” the court must decide whether to certify the proposed class. Though we are now well over a year into some of the earliest suits filed, this has yet to happen. In the meantime what we have are proposed class definitions offered by plaintiffs. How broadly or narrowly a class is defined by the plaintiffs will be one of the most important factors in whether the class can be certified since it will directly affect the commonality of facts among the class, the typicality of claims, and whether the representatives can fairly and adequately represent the interests of the class. Plaintiffs have the burden of proving that they have satisfied Rule 23. 

In these AI lawsuits, we see some themes in terms of class representative and proposed classes, with many offering very broad class definitions. For example, in the now-consolidated In re OpenAI ChatGPT Litigation, the class representatives are 11 fiction writers of books such as The Cabin at the End of the World, The Brief Wondrous Life of Oscar Wao, What the Dead Know and others. 

They propose to represent a class defined as follows:  

“All persons or entities domiciled in the United States that own a United States copyright in any work that was used as training data for the OpenAI Language Models during the Class Period [defined as June 28, 2020 to the present].” 

This kind of broad “anyone with a copyright in a work used for training” approach to class definition is repeated in a few other suits. For example, the consolidated Kadrey v. Meta lawsuit has a similar (and overlapping) grouping of fiction author class representatives and an almost identical proposed class definition. Dubus v. NVIDIA is another suit that takes essentially the same approach. 

Other AI lawsuits have more variation in class representatives. Huckabee v. Bloomberg, for example, is another suit with a similar class definition (basically, all copyrighted works owned by someone in the US and used for training Bloomberg’s LLM) but with class representatives that are a bit different: mostly authors of religious books and of course, Mike Huckabee, a politician. 

There is at least one class action that is more precise both in terms of proposed class representatives and their relation to the proposed class definition. The now-consolidated Authors Guild v. OpenAI suit has some 28 proposed class representatives, most of whom are authors of best-selling fiction and non-fiction trade books, 14 of whom are members of the Authors Guild. In this suit, the plaintiffs propose two classes: one for fiction authors and one for non-fiction authors. It also places some restrictions around them: class members for fiction works must be “natural persons” who are “sole authors of, and legal or beneficial owners of Eligible Copyrights in” fictional works that were registered with the U.S. Copyright Office and used for training the defendants’ LLMs (and this includes persons who are beneficiaries of works held by literary estates). For nonfiction authors, class members are “[a]ll natural persons, literary trusts, and literary estates in the United States who are legal or beneficial owners of Eligible Nonfiction Copyrights’ which the complaint defines as works used to train defendants’ LLMs and that have an ISBN with the exception of any books classified as reference works (BISAC code REF). 

Some challenges and dangers
When you consider the scale and scope of materials used to train the AI models in question, you can immediately see some of the challenges that are likely to arise with relatively small groups of authors attempting to represent practically all individual U.S. copyright owners. 

While the exact training materials used for the models at issue remain opaque, it’s definitely true that they were not just trained on modern fiction. There is widespread acknowledgment that these models are trained on a large amount of content scraped from across the internet using data sources such as Common Crawl. This, in effect, means that these suits implicate the rights of millions of rights holders, with interests as diverse as those of YouTube content creators, computer programmers, novelists, academics, and more. 

How can these representatives fairly and adequately represent such a broad and diverse group–especially when many may disagree with the underlying motivations for the suit to begin with–is a tough question. Even the Authors Guild consolidated case, which is much more careful in terms of class definition, includes classes that are breathtakingly broad when one considers the diversity of authorship within them. The fiction author class, for example, could include everyone from NY Times bestselling authors to fan fiction writers. The nonfiction class, which is at least limited to nonfiction book authors of works assigned an ISBN, could similarly include everyone from authors of popular self-help books distributed by the millions to scholarly books with print runs in the low hundreds and distributed online on open-access terms. The interests, financial and otherwise, of those authors can vary significantly. 

Beyond the adequacy of representatives (along with questions about whether their experiences are really typical of others in the proposed class), there are other challenges unique to copyright law, for example, the opaque nature of ownership (there is no official public record of who owns what), making ascertaining who actually falls within the class an initial challenge. Compounding that, there are a dizzying variety of unique terms under which works are distributed online, some of which may afford AI developers a viable defense for many works. A fair use defense also requires some level of assessment of the nature of the works used, a fact-intensive inquiry that will vary from one work to another. This just scratches the surface of some of the issues that likely mean there really aren’t common questions of law or fact among the class. 

Conclusion
There are good reasons to think that the classes as currently defined in these lawsuits are too broad. For some of the reasons mentioned above, I think it will be difficult for courts to certify them as is. But this doesn’t mean authors and other rightsholders should sit back and assume that their interests won’t be co-opted by others in these suits who seek to represent them. We don’t know when the courts will actually address these class certification issues in these suits. When they do, it will be important for authors to speak up. 

What is “Derivative Work” in the Digital Age?

Posted October 7, 2024
on the top, Seltzer v. Green Day; on the bottom, Kienitz v. Sconnie Nation

Part I: The Problem with “Derivative Work”

The right to prepare derivative works is one of the exclusive rights copyright holders have under §106 of the Copyright Act. Other copyright holders’ exclusive rights include the right to make and distribute copies, and to display or perform a work publicly. 

Lately, we’ve seen a congeries of novel conceptions about “derivative works.” For example, a reader of our blog stated that when looking at AI models and AI outputs, works should be considered infringing “derivatives” even when there is no substantial similarity between the infringing AI model/outputs and the ingested originals. Even in the courts, we’ve seen confusion, for example, Hachette v. Internet Archive presented us with the following statement about derivative works:

Changing the medium of a work is a derivative use rather than a transformative one. . . . In fact, we have characterized this exact use―“the recasting of a novel as an e-book”―as a “paradigmatic” example of a derivative work. [citation omitted; emphasis added]

These statements leave one to wonder—what is a copy, a derivative work, an infringing use, and a transformative fair use in the context of U.S. copyright law? In order to have some clarity on these questions, it’s helpful to juxtapose “derivative works” first with “copies” and then with “transformative uses.” We think the confusion about derivative work and its related concepts arises out of using the phrase to mean “a work that is substantially similar to the original work” as well as “a work that is so in an unauthorized way, not excused from liabilities.”

There are many immediate real world implications for confusion over the meaning of “derivative work.” In privately negotiated agreements, licensees who have a right to make reproductions but not derivative works may be confused as to what medium their use is restricted to. For example, a publisher of a book with a license that allows it to make reproductions but not derivatives might be confused as to whether, under the Hachette court’s reasoning, it is allowed to republish a print book in a digital format such as a simple PDF of a scan. Similarly, for public licenses, such as the CC ND licenses, where a licensor stipulates restriction on the creation of derivative works, it causes confusion for downstream users whether, say, changing a pdf into a Word document is allowed. 

This is also an important topic to explore both in the recent hot debates over Controlled Digital Lending and generative artificial intelligence, as well as in an author’s everyday work—for instance, would quoting someone else’s work make your article/book a derivative work of the original? 

Part II: “Copies” and “Derivatives”

Our basic understanding of derivative works comes from the 1976 Copyright Act. The §101 definition tells us:

A “derivative work” is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a “derivative work”.

The U.S. Copyright Office published Circular 14 gives some further helpful guidance as to what a §106 derivative work would look like:

To be copyrightable, a derivative work must incorporate some or all of a preexisting “work” and add new original copyrightable authorship to that work. The derivative work right is often referred to as the adaptation right. The following are examples of the many different types of derivative works: 

  • A motion picture based on a play or novel 
  • A translation of an novel written in English into another language
  • A revision of a previously published book 
  • A sculpture based on a drawing 
  • A drawing based on a photograph 
  • A lithograph based on a painting 
  • A drama about John Doe based on the letters and journal entries of John Doe 
  • A musical arrangement of a preexisting musical work 
  • A new version of an existing computer program 
  • An adaptation of a dramatic work 
  • A revision of a website

One immediate observation that can be made from reading these, is that “ebook” or “digitized version of a work” is not listed as, nor similar to any of the exemplary derivative works in the Copyright Act or the Copyright Office Circular. By contrast, “ebook” or “digitized version of a work” seems to fit much better under the § 101 definition of “copies”:

“Copies” are material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. The term “copies” includes the material object, other than a phonorecord, in which the work is first fixed.

The most crucial difference between a “copy” and a “derivative work” is whether new authorship is added. If no new authorship is added, merely changing the material that the work is fixed on does not create a new copyrightable derivative work. This, in fact, is observed by many courts before Hachette. For example, in Corel v. Bridgeman Art Gallery, the court unequivocally held that there is no new copyright granted to photos of public domain paintings. 

Additionally, as we know from Feist v. Rural Tel., “[t]he mere fact that a work is copyrighted does not mean that every element of the work may be protected.” Copyright protection is only limited to the original elements of a work. We cannot call a work “derivative” of another if it does not incorporate any copyrightable elements from the original copyrighted work. For example, the “Game Genie” device, which let players change elements of a Nintendo game, was not found to be a derivative work by the court because it didn’t incorporate any part of the Nintendo game. 

It is clear from this examination that sometimes a later-created work is a copy, sometimes a derivative, and sometimes it may not implicate any of the exclusive rights of the original.

Part III: “Derivative” and “Transformative” Works

Let’s quickly recap the context in which courts are confusing “derivative” and “transformative” works—

A prima facie case of copyright infringement requires the copyright holder to prove (1) ownership of a valid copyright, and (2) inappropriate copying of original elements. We will not go into more details here, but essentially, the inappropriate copying prong requires plaintiffs to assert and prove defendant’s access to the plaintiff’s work as well as a level of similarity between the works in question that shows improper appropriation of the plaintiff’s work. If the similarity between the defendant’s work and protectable elements in the plaintiff’s work is minimal, then there is no infringement. As seen in the  “Game Genie” example above, courts can rely on substantial similarity analysis to determine whether a work is indeed a potentially-infringing copy or derivative of the plaintiff’s work.

Once the plaintiff establishes a prima facie infringement case—e.g., the defendant’s work is shown to be a derivative or a copy of the plaintiff’s registered work—the defendant may still nevertheless be free to make the use if the use falls outside the ambit of the copyright holder’s §106 rights, such as uses that are fair use. Whether a work is a derivative work under § 106 is no longer a relevant inquiry after establishing a prima facie case: this point is starkly obvious when looking at the many plausible defenses a defendant can raise (including fair use) where even the verbatim copying of a work is authorized by law. 

As the court stated in Authors Guild v. Hathitrust, “there are important limits to an author’s rights to control original and derivative works. One such limit is the doctrine of ‘fair use,’ which allows the public to draw upon copyrighted materials without the permission of the copyright holder in certain circumstances.” When a prima facie infringement case is already established, yet a court still discusses whether the defendant’s work is a “derivative work,” at a minimum, the court adds confusion by beyond the § 101  definition of a derivative work. 

In fact, a distinct new significance is being given to “derivative work” in recent years in the context of the “purpose and character” factor of fair use, specifically, when analyzing if a use has a transformative purpose. The shift in a word’s meaning or a concept is not per se unimaginable or objectionable. It is misguided to consider the copyright legal landscape static. As law professor Pamela Samuelson pointed out, before the mid-19th century, most courts did not even think copyright holders were entitled to demand compensation from others preparing derivative works. The 1976 Copyright Act finally codified copyright holders’ exclusive right to prepare derivative works. And, now, some rights holders want the courts to say there are categorical derivative uses that can never be considered fair use.

The Hachette court is among those that have unfortunately bought into this novel approach. The court seems not only to misconstrue the salient distinction between a ‘copy of a work’ and a ‘derivative work’, they appear to give heightened protections to works they now define as ‘derivative’. If this misconception becomes widespread, we will be living in a world where if a use is new-derivative, then it is never transformative (and, if it is not transformative, it is likely not fair). Ultimately, it is purely circular for a court to say that the reason for denying the fair use defense is that the use is derivative. When we buy into this setup of “derivative v.s. transformative,” it is difficult to ever say with confidence that a work is transformative, because at the same time we remember how a transformative use should often fit in the actual definition of derivative work under § 101, “derivative”—just like the Green Day rendition of the plaintiff’s art in Seltzer v. Green Day.  

Clearly, if we take “derivative work” at its true § 101 definition, out of all potentially infringing works, “transformative fair use” is not an absolute complement, but a possible subset, of derivative works. We know from Campbell v. Acuff-Rose that “transformativeness is a matter of degree, not a binary;” whereas no such sliding scale is plausible for derivative works. A work is either a derivative or it is not: there’s never a “somewhat derivative” work in copyright. All in all, it makes little sense to frame the issues as “transformative v.s. derivative work”—such discussions inevitably buy into the rhetorics of copyright expansionists. We have already warned the court in Warhol against the danger of speaking heedlessly about derivative works in the context of fair use. We must ensure that the “derivative v.s. transformative” dichotomy does not come to dominate future discussions of fair use, so that we conserve the utility and clarity of the fair use doctrine.

The expansion of the relevance of “derivative work” beyond the establishment of a prima facie infringement case not only creates a circular reasoning for denying fair use, but also makes it impossible to make sense of the case law we have accumulated on fair use. Take Seltzer v. Green Day for example, the court held that a work can be transformative even if that work “makes few physical changes to the original.” The Green Day concert background art with a red cross superimposed was found to be a fair use of the original street art—a classic example of how a prima facie infringing derivative work can nevertheless be a transformative, and thus fair, use. Similarly, in Kienitz v. Sconnie Nation, a derivative use of a photo on a tshirt was found to be a fair use. Ideas and concepts, including “derivative works,” are only important to the extent they elucidate our understanding of the world. When the use of “derivative works” leads to more confusion than clarity, we should be cautious in adopting the new meaning being superimposed on “derivative works.”

The AI Copyright Hype: Legal Claims That Didn’t Hold Up

Posted September 3, 2024

Over the past year, two dozen AI-related lawsuits and their myriad infringement claims have been winding their way through the court system. None have yet reached a jury trial. While we all anxiously await court rulings that can inform our future interaction with generative AI models, in the past few weeks, we are suddenly flooded by news reports with titles such as “US Artists Score Victory in Landmark AI Copyright Case,” “Artists Land a Win in Class Action Lawsuit Against A.I. Companies,” “Artists Score Major Win in Copyright Case Against AI Art Generators”—and the list goes on. The exuberant mood in these headlines mirror the enthusiasm of people actually involved in this particular case (Andersen v. Stability AI). The plaintiffs’ lawyer calls the court’s decision “a significant step forward for the case.” “We won BIG,” writes the plaintiff on X

In this blog post, we’ll explore the reality behind these headlines and statements. The “BIG” win in fact describes a portion of the plaintiffs’ claims surviving a pretrial motion to dismiss. If you are already familiar with the motion to dismiss per Federal Rules of Civil Procedure Rule 12(b)(6), please refer to Part II to find out what types of claims have been dismissed early on in the AI lawsuits. 

Part I: What is a motion to dismiss?

In the AI lawsuits filed over the last year, the majority of the plaintiffs’ claims have struggled to survive pretrial motions to dismiss. That may lead one to believe that claims made by plaintiffs are scrutinized harshly at this stage. But that is far from the truth. In fact, when looking at the broader legal landscape beyond the AI lawsuits, Rule 12(b)(6) motions are rarely successful.

In order to survive a Rule 12(b)(6) motion to dismiss filed by AI companies, plaintiffs in these lawsuits must make “plausible” claims in their complaint. At this stage, the court will assume that all of the factual allegations made by the plaintiffs are true and interpret everything in a way most favorable to plaintiffs. This allows the court to focus on the key legal questions without getting caught up in disputes about facts. When courts look at plaintiffs’ factual claims in the best possible light, if the defendant AI companies’ liability can plausibly be inferred based on facts stated by plaintiffs, then the claims will survive a motion to dismiss. Notably, the most important issues at the core of these AI lawsuits—namely, whether there has been direct copyright infringement and what may count as a fair use—are rarely decided at this stage, because these claims raise questions about facts as well as the law. 

On the other hand, if the AI companies will prevail as a matter of law even when the plaintiffs’ well-pleaded claims are taken as entirely true, then the plaintiffs’ claims will be dismissed by court. Merely stating that it is possible that the AI companies have done something unlawful, for instance, will not survive a motion to dismiss; there must be some reasonable expectation that evidence can be found later during discovery to support the plaintiffs’ claims. 

Procedurally, when a claim is dismissed, the court will often allow the plaintiffs to amend their complaint. That is exactly what happened with Andersen v. Stability AI (the case mentioned at the beginning of this blog post): the plaintiffs’ claims were first dismissed in October last year, and the court allowed the plaintiffs to amend their complaint to address the deficiencies in their allegations. The newly amended complaint contains infringement claims that survived new motions to dismiss, as well as other breach of contract, unjust enrichment, and DMCA claims that again were dismissed.

As you may have guessed, including something like the “motion to dismiss” in our court system can help save time and money, so parties don’t waste precious resources on meritless claims at trial. One judge dismissed a case against OpenAI earlier this year, stating that “the plaintiffs need to understand that they are in a court of law, not a town hall meeting.” The takeaway: plaintiffs need to bring claims that can plausibly entitle them to relief.

Part II: What claims are dismissed so far?

Most of the AI lawsuits are still at an early stage, and most of the court rulings we have seen so far are in response to the defendants’ motions to dismiss. From these rulings, we have learned which claims are viewed as meritless by courts. 

The removal of copyright management information (“CMI,” which includes information such as the title, the copyright holder, and other identifying information in a copyright notice) is a claim included in almost all plaintiffs’ complaints in the AI lawsuits, and this claim has failed to survive motions to dismiss without exception. DMCA Section 1202(b) restricts the intentional, unauthorized removal of CMI. Experts initially considered DMCA 1202(b) one of the biggest hurdles for non-licensed AI training. But courts so far have dismissed all DMCA 1202(b) claims, including in J. Doe 1 v. GitHub, Tremblay v. OpenAI, Andersen v. Stability AI, Kadrey v. Meta Platforms, and Silverman v. OpenAI. The plaintiffs’ DMCA Section 1202(b)(1) claims have failed because plaintiffs were not able to offer any evidence showing their CMI has been intentionally removed by the AI companies. For example, in Tremblay v. OpenAI and Silverman v. OpenAI, the courts held that the plaintiffs did not argue plausibly that OpenAI has intentionally removed CMI when ingesting plaintiffs’ works for training. Additionally, plaintiffs’ DMCA Section 1202(b)(3) have failed thus far because the plaintiffs’ claims did not fulfill the identicality requirement. For example, in J. Doe 1 v. GitHub, the court pointed out that Copilot’s output did not tend to represent verbatim copies of the original ingested code. We now see plaintiffs voluntarily dropping the DMCA claims in their amended complaints, such as in Leovy v Google (formerly J.L. vs Alphabet). 

Another claim that has been consistently dismissed by courts is that AI models are infringing derivative works of the training materials. The law defines a derivative work as “a work based upon one or more preexisting works, such as a translation, musical arrangement, … art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted.” To most of us, the idea that the model itself (as opposed to, say, outputs generated by the model) can be considered a derivative work seems to be a stretch. The courts have so far agreed. On November 20, 2023, the court in Kadrey v. Meta Platforms said it is “nonsensical” to consider an AI model a derivative work of a book just because the book is used for training. 

Similarly, claims that all AI outputs should be automatically considered infringing derivative works have been dismissed by courts, because the claims cannot point to specific evidence that an instance of output is substantially similar to an ingested work. In Andersen v. Stability AI, plaintiffs tried to argue “that all elements of [] Anderson’s copyrighted works [] were copied wholesale as Training Images and therefore the Output Images are necessarily derivative;” the court dismissed the argument because—besides the fact that plaintiffs are unlikely able to show substantial similarity—“it is simply not plausible that every Training Image used to train Stable Diffusion was copyrighted [] or that all [] Output Images rely upon (theoretically) copyrighted Training Images and therefore all Output images are derivative images. … [The argument for dismissing these claims is strong] especially in light of plaintiffs’ admission that Output Images are unlikely to look like the Training Images.”

Several of these AI cases have raised claims of vicarious liability—that is, liability for the service provider based on the actions of others, such as users of the AI models. Because a vicarious infringement claim must be based on a showing of direct infringement, the vicarious infringement claims are also dismissed in Tremblay v. OpenAI and Silverman v. OpenAI, when plaintiffs cannot point to any infringing similarity between AI output and the ingested books.

Many plaintiffs have also raised a number of non-copyright, state law claims (such as negligence or unfair competition) that have largely been dismissed based on copyright preemption. Copyright preemption prevents duplicitous state law claims when those state law claims are based on an exercise of rights that are equivalent to those provided for under the federal Copyright Act. In Andersen v. Stability AI, for example, the court dismissed the plaintiffs’ unjust enrichment claim because the plaintiffs failed to add any new elements that would distinguish their claim based on California’s Unfair Competition Law or common law from rights under the Copyright Act.

It is interesting to note that many of the dismissed claims in different AI lawsuits closely mimic one another, such as in Kadrey v. Meta Platforms, Andersen v. Stability AI, Tremblay v. OpenAI, and Silverman v. OpenAI. It turns out that the similarities are no coincidence—all these lawsuits are filed by the same law firm. These mass-produced complaints not only contain overbroad claims that are prone to dismissal, they also have overbroad class designations. In the next blog post, we will delve deeper into the class action aspect of the AI lawsuits. 

Clickbait arguments in AI Lawsuits (will number 3 shock you?)

Posted August 15, 2024

Image generated by Canva

The booming AI industry has sparked heated debates over what AI developers are legally allowed to do. So far, we have learned from the US Copyright Office and courts that AI created works are not protectable, unless it is combined with human authorship. 

As we monitor two dozen ongoing lawsuits and regulatory efforts that address various aspects of AI’s legality, we see legitimate legal questions that must be resolved. However, we also see some prominent yet flawed arguments that have been used to enflame discussions, particularly by publisher-plaintiffs and their supporters. For now, let’s focus on some clickbait arguments that sound appealing but are fundamentally baseless. 

Will AI doom human authorship?

Based on current research, AI tools can actually help authors improve creativity, productivity, as well as the longevity of their career

When AI tools such as ChatGPT first appeared online, many leading authors and creators publicly endorsed it as a useful tool like any other tech innovation that came before it. At the same time, many others claimed that authors and creators of lesser caliber will be disproportionately disadvantaged by the advent of AI. 

This intuition-driven hypothesis, that AI will be the bane of average authors, has so far proved to be misguided.

We now know that AI tools can greatly help authors during the ideation stage, especially for less creative authors. According to a study published last month, AI tools had minimal impact on the output of highly creative authors, but were able to enhance the works of less imaginative authors. 

AI can also serve as a readily-accessible editor for authors. Research shows that AI enhances the quality of routine communications. Without AI-powered tools, a less-skilled person will often struggle with the cognitive burden of managing data, which limits both the quality and quantity of their potential output. AI helps level the playing field by handling data-intensive tasks, allowing writers to focus more on making creative and other crucial decisions about their works. 

It is true that entirely AI-generated works of abysmal quality are available for purchase on some platforms. Some of these works are using human authors’ names without authorization. These AI-generated works may infringe on authors’ right of publicity, but they do not present commercially-viable alternatives to books authored by humans. Readers prefer higher-quality works produced with human supervision and interference (provided that digital platforms do not act recklessly towards their human authors despite generating huge profits from human authors).

Are lawsuits against AI companies brought with authors’ best interest in mind? 

In the ongoing debate over AI, publishers and copyright aggregators have suggested that they have brought these lawsuits to defend the interests of human authors. Consider the New York Times for example, in its complaint against OpenAI, NY Times describes their operations as “a creative and deeply human endeavor (¶31)” that necessitates “investment of human capital (¶196).” NY Times argues that OpenAI has built innovation on the stolen hard work and creative output from journalists, editors, photographers, data analysts, and others—an argument contrary to what the NY Times once argued in court in New York Times v. Tasini,  that authors’ rights must take a backseat to NY Times’ financial interests in new digital uses.  

It is also hard to believe that many of the publishers and aggregators are on the side of authors when we look at how they have approached licensing deals for AI training. These licensing deals can be extremely profitable for the publishers. For example, Taylor and Francis sold AI training data to OpenAI for 10 million USD. John Wiley and Sons earned $23 million from a similar deal with a non-disclosed tech company. Though we don’t have the details of these agreements, it seems easy to surmise that in return for the money received, the publishers will not harass the AI companies with future lawsuits. (See our previous blog post about these licensing deals and what you can do as an author.) It is ironic how an allegedly unethical and harmful practice quickly becomes acceptable once the publishers are profiting from it.

How much of the millions of dollars changing hands will go to individual authors? Limited data exist. We know that Cambridge University Press, a good-faith outlier, is offering authors 20% royalties if their work is licensed for AI training. Most publishers and aggregators are entirely opaque about how authors are to be compensated in these deals. Take the Copyright Clearance Center (CCC) for example, it offers zero information about how individual authors are consulted or compensated when their works are sold for AI training under CCC AI training license.

This is by no means a new problem for authors. We know that traditionally-published book authors receive around 10% of royalties from their publishers: a little under $2 per copy for most books. On an ebook, authors receive a similar amount for each “copy” sold. This little amount handed to authors only starts to look generous when compared to academic publishing, where authors increasingly pay publishers to have their articles published in journals. The journal authors receive zero royalties, despite the publishers’ growing profit

Even before the advent of AI technology, most authors were struggling to make a living on writing alone. According to an Authors Guild’s survey in 2018, the median income for full-time writers was $20,300, and for part-time writers, a mere $6,080. Fair wage and equitable profit sharing is an issue that needs to be settled between authors and publishers, even if publishers try to scapegoat AI companies. 

It’s worth acknowledging that it’s not just publishers and copyright industry organizations filing these lawsuits. Many of these ongoing lawsuits have been filed as class actions, with the plaintiffs claiming to represent a broad class of people who are similarly situated and (thus they alleged) hold similar views. Most notably, in Authors Guild v. OpenAI, Authors Guild and its named individual plaintiffs claim to represent all fiction writers in the US who have sold more than 5000 copies of a work. There’s also another case where plaintiff claims to represent all copyright holders of non-fiction works, including authors of academic journal articles, which got support from Authors Guild, and several others in which an individual plaintiff asserts the right to represent virtually all copyright holders of any type

As we (along with many others) have repeatedly pointed out, many authors disagree with the publishers and aggregators’ restrictive view on fair use in these cases, and don’t want or need a self-appointed guardian to “protect” their interests.  We have seen the same over-broad class designation in the Authors Guild v. Google case, which caused many authors to object, including many of our own 200 founding members.

Respect for copyright and human authors’ hard work means no more AI training under US copyright law? 

While we wait for courts to figure out the key questions on infringement and fair use, let’s take a moment to remember what copyright law does not regulate.

Copyright law in the US exists to further the Constitutional goal to “promote the Progress of Science and useful Arts.” In 1991, the Supreme Court held in Feist v. Rural Telephone Service that copyright cannot be granted solely based on how much time or energy authors have expended. “Compensation for hard work“ may be a valid ethical discussion, but it is not a relevant topic in the context of copyright law.

Publishers and aggregators preach that people must “respect copyright,” as if copyright is synonymous with the exclusive rights of the copyright holder. This is inaccurate and misleading. In order to safeguard the freedom of expression, copyright is designed to embody not only the rightsholders’ exclusive rights but also many exceptions and limitations to the rightsholders’ exclusive rights. Similarly, there’s no sound legal basis to claim that authors must have absolute control over their own work and its message. Knowledge and culture thrives because authors are permitted to build upon and reinterpret the works of others

Does this mean I should side with the AI companies in this debate?

Many of the largest AI companies exhibit troubling traits that they have in common with many publishers, copyright aggregators, digital platforms (e.g., Twitter, TikTok, Youtube, Amazon, Netflix, etc.), and many other companies with dominant market power. There’s no transparency or oversight afforded to the authors or the public. The authors and the public have little say in how the AI models are trained, just like how we have no influence over how content is moderated on digital platforms, how much royalties authors receive from the publishers, or how much publishers and copyright aggregators can charge users. None of these crucial systematic flaws will be fixed by granting publishers a share of AI companies’ revenue. 

Copyright also is not the entire story. As we’ve seen recently, there are some significant open questions about the right of publicity and somewhat related concerns about the ability of AI to churn out digital fakes for all sorts of purposes, some of which are innocent, but others are fraudulent, misleading, or exploitative. The US Copyright Office released a report on digital replicas on July 31 addressing the question of digital publicity rights, and on the same day the NO FAKES Act was officially introduced. Will the rights of authors and the public be adequately considered in that debate? Let’s remain vigilant as we wait to see the first-ever AI-generated public figure in a leading role to hit theaters in September 2024.

Some Initial Thoughts on the US Copyright Office Report on AI and Digital Replicas

Posted August 1, 2024

On July 31, 2024, the U.S. Copyright Office published Part 1 of its report summarizing the Office’s ongoing initiative of artificial intelligence. This first part of the report addresses digital replicas, in other words, how AI is used to realistically but falsely portray people in digital media. The Office in its report recommends new federal legislation that would create a new right to control “digital replicas” which it defines as  “a video, image, or audio recording that has been digitally created or manipulated to realistically but falsely depict an individual.”

We remain somewhat skeptical that such a right would do much to address the most troubling abuses such as deepfakes, revenge porn, and financial fraud. But, as the report points out, a growing number of varied state legislative efforts are already in the works, making a stronger case for unifying such rules at the federal level, with an opportunity to ensure adequate protections are in place for creators.  

The backdrop for the inquiry and report is a fast-developing space of state-led legislation, including legislation with regard to deepfakes. Earlier this year, Tennessee became the first state to enact such a law, the ELVIS Act (TN HB 2091), while other states mostly focused on addressing deepfakes in the context of sexual acts and political campaigns. New state laws are continuing to be introduced, making it harder and harder to navigate the space for creators, AI companies, and consumers alike. A federal right of publicity in the context of AI has already been discussed in Congress, and just yesterday a new bill was formally introduced, titled the “NO AI Fakes Act.” 

Authors Alliance has watched the development of this US Copyright Office initiative closely. In August 2023, the Office issued a notice of inquiry, asking stakeholders to weigh in on a series of questions about copyright policy and generative AI.  Our comment in response to the inquiry was devoted in large part to sharing the ways that authors are using generative AI, how fair use should apply to training AI, and that the USCO should be cautious in recommending new legislation to Congress

This report and recommendation from the Copyright Office could have a meaningful impact on authors and other creators, including both those whose personality and images are subject to use with AI systems, and those who are actively using AI in the writing and research. Below are our preliminary thoughts on what the Copyright Office recommends, which it summarizes in the report as follows: 

“We recommend that Congress establish a federal right that protects all individuals during their lifetimes from the knowing distribution of unauthorized digital replicas. The right should be licensable, subject to guardrails, but not assignable, with effective remedies including monetary damages and injunctive relief. Traditional rules of secondary liability should apply, but with an appropriately conditioned safe harbor for OSPs. The law should contain explicit First Amendment accommodations. Finally, in recognition of well-developed state rights of publicity, we recommend against full preemption of state laws.”

Initial Impressions

Overall, this seems like a well-researched and thoughtful report, given that the Office had to navigate a huge number of comments and opinions (over 10,000 comments were submitted). The report also incorporates the many more recent developments that included numerous new state laws and federal legislative proposals.  

Things we like: 

  • In the context of an increasing number of state legislative efforts—some overbroad and more likely than not to harm creators than help them—we appreciate the Office’s recognition that a patchwork of laws can pose a real problem for users and creators who are trying to understand their legal obligations when using AI that references and implicates real people.
  • The report also recognizes that the collection of concerns motivating digital replica laws—things like control of personality, privacy, fraud, and deception—are not at their core copyright concerns. “Copyright and digital replica rights serve different policy goals; they should not be conflated.” This matters a lot for what the scope of protection and other details for a digital replica right looks like. Copy-pasting copyright’s life+70 term of protection, for example, makes little sense (and the Office recognizes this, for example, by rejecting the idea of posthumous digital replica rights). 
  • The Office also suggests limiting the transferability of rights. We think this is a good idea to protect individuals from unanticipated downstream use by companies that may persuade individuals to sign deals that would lock them into unfavorable long-term deals. “Unlike publicity rights, privacy rights, almost without exception, are waivable or licensable, but cannot be assigned outright. Accordingly, we recommend a ban on outright assignments, and the inclusion of appropriate guardrails for licensing, such as limitations in duration and protection for minors.” 
  • The Office explicitly rejects the idea of a new digital replica right covering “artistic style.” We agree that protection of artistic style is a bad idea. Creators of all types have always used existing styles and methods as a baseline to build upon, and it’s resulted in a rich body of new works. Allowing for control over “style” however well-defined, would impinge on these new creations. Strong federal protection over “style” would also contradict traditional limitations on rights, such as Section 102(b)’s limits on copyrightable subject matter and the idea/expression dichotomy, which are rooted in the Constitution. 

Some concerns: 

  • The Office’s proposal would apply to the distribution of digital replicas, which are defined as “a video, image, or audio recording that has been digitally created or manipulated to realistically but falsely depict an individual.” This definition is quite broad and could potentially include a large number of relatively common and mostly innocuous uses—e.g., taking a photo with your phone of a person and applying a standard filter on your camera app could conceivably fall within the definition. 
  • First Amendment rights to free expression are critical for protecting uses for news reporting, artistic uses, parody and so on. Expressive uses of digital replicas—e.g., a documentary that uses AI to replicate a recent event involving recognizable people, or reproduction in a comedy show to to poke fun at politicians—could be significantly hindered by an expansive digital replica right unless it has robust free expression protections. Of course, the First Amendment applies regardless of the passing of a new law, but it will be important for any proposed legislation to find ways to allow people to exercise those rights effectively. As the report explains, comments were split. Some like the Motion Picture Association proposed enumerated exceptions for expressive use, while others such as the Recording Industry Association of America took the position that “categorical exclusions for certain speech-oriented uses are not constitutionally required and, in fact, risk overprotection of speech interests at the expense of important publicity interests.” 

We tend to think that most laws should skew toward “overprotection of speech interests,” but the devil is in the details on how to do so. The report leaves much to be desired on how to do this effectively in the context of digital replicas. For its part, “[t]he Office stresses the importance of explicitly addressing First Amendment concerns. While acknowledging the benefits of predictability, we believe that in light of the unique and evolving nature of the threat to an individual’s identity and reputation, a balancing framework is preferable.” One thing to watch in future proposals is what such a balancing framework actually includes, and how easy or difficult it is to assert protection of First Amendment rights under this balancing framework. 

  • The Office rejects the idea that Section 230 should provide protection for online service providers if they host content that runs afoul of the proposed new digital replica rights. Instead, the Office suggests something like a modified version of the Copyright Act’s DMCA section 512 notice and takedown process. This isn’t entirely outlandish—the DMCA process mostly works, and if this new proposed digital replica right is to be effective in practice, asking large service providers that are benefiting from hosting content to be responsive in cases of alleged infringing content may make sense. But, the Office says that it doesn’t believe the existing DMCA process should be the model, and points to its own Section 512 report for how a revised version for digital replicas might work. If the Office’s 512 study is a guide to what a notice-and-takedown system could look like for digital replicas, there is reason to be concerned.  While the study rejected some of the worst ideas for changing the existing system (e.g., a notice-and-staydown regime), it also repeatedly diminished the importance of ideas that would help protect creators with real First Amendment and fair use interests. 
  • The motivations for the proposed digital replica right are quite varied. For some commenters, it’s an objection to the commercial exploitation of public figures’ images or voices. For others, the need is to protect against invasions of privacy. For yet others, it is to prevent consumer confusion and fraud. The Office acknowledges these different motivating factors in its report and in its recommendations attempts to balance competing interests among them. But, there are still real areas of discontinuity—e.g., the basic structure of the right the Office proposes is intellectual-property-like. But it doesn’t really make a lot of sense to try to address some of the most pernicious fraudulent uses, such as deepfakes to manipulate public opinion, revenge porn, or scam phone calls, with a privately enforced property right oriented toward commercialization. Discovering and stopping those uses requires a very different approach and one that this particular proposal seems ill-equipped to deal with. 

Barely a few months ago, we were extremely skeptical that new federal legislation on digital replicas was a good idea. We’re still not entirely convinced, but the rash of new and proposed state laws does give us some pause. While the federal legislative process is fraught, it is also far from ideal for authors and creators to operate under a patchwork of varying state laws, especially those that provide little protection for expressive uses. Overall, we hope certain aspects of this report can positively influence the debate about existing federal proposals in Congress, but remain concerned about the lack of detail about protections for First Amendment rights. 

In the meantime, you can check out our two new resource pages on Generative AI and Personality Rights to get a better understanding of the issues.