Author Archives: Dave Hansen

Authors Alliance 2024 Annual Report

Posted December 17, 2024

Authors Alliance celebrated an important milestone in 2024: our 10th anniversary! 

Quite a lot has changed since 2014, but our mission remains the same. We exist to advance the interests of authors who want to serve the public good by sharing their creations broadly.  I’m pleased to share our 2024 annual report, where you can find highlights of our work this year to promote laws, policies, and practices that enable authors to reach wide audiences.

Our success in 2024 was largely due to the wonderful collaboration and support we have from our members. You’ll see in the report a number of ongoing projects and issues we are working to address: legal questions about open access publishing, rights reversion at scale, supporting text data mining research, addressing contractual override of fair use,  AI and copyright, and more. As we look to 2025, I would love to hear from you if you have a special interest in any of these projects and would like to contribute your ideas, time, or expertise to help us tackle them.

I’m grateful for those of you who contributed financially to make 2024 a success. Authors Alliance is funded almost entirely by gifts and grants, and so we truly rely on you. As we end the year, I hope you will consider giving if you haven’t done so already. You can donate online here.

Thank you,

Dave Hansen
Executive Director 


Restricting Innovation: How Publisher Contracts Undermine Scholarly AI Research

Posted December 6, 2024
Photo by Josh Appel on Unsplash

This post is by Rachael Samberg, Director, Scholarly Communication & Information Policy, UC Berkeley Library and Dave Hansen, Executive Director, Authors Alliance

This post is about the research and the advancement of science and knowledge made impossible when publishers use contracts to limit researchers’ ability to use AI tools with scholarly works. 

Within the scholarly publishing community, mixed messages pervade about who gets to say when and how AI tools can be used for research reliant on scholarly works like journal articles or books. Some scholars voiced concern (explained more here) when major scholarly publishers like Wiley or Taylor & Francis entered lucrative contracts with big technology companies to allow for AI training without first seeking permission from authors. We suspect that these publishers have the legal right to do so since most publishers demand that authors hand over extensive rights in exchange for publishing their work. And with the backdrop of dozens of pending AI copyright lawsuits, who can blame the AI companies for paying for licenses, if for no other reason than avoiding the pain of litigation? While it stings to see the same large commercial, academic publishers profit yet again off of the work academic authors submit to them for free, we continue to think there are good ways for authors to retain a say in the matter. 

 Big tech companies are one thing, but what about scholarly research? What about the large and growing number of scholars who are themselves using scholarly copyrighted content with AI tools to conduct their research? We currently face a situation in which publishers are attempting to dictate how and when researchers can do that work, even when authors’ fair use rights to use and derive new understandings from scholarship clearly allow for such uses. 

How vendor contracts disadvantage US researchers

We have written elsewhere (in an explainer and public comment to the Copyright Office) why training AI tools, particularly in the scholarly and research context, constitutes a fair use under U.S. Copyright law. Critical for the advancement of knowledge, training AI is based on a statutory right already held by all scholarly authors engaging in computational research and one that lawmakers should preserve. 

The problem U.S. scholarly authors presently face with AI training is that publishers restrict their access to these statutory rights through contracts that override them: In the United States, publishers can use private contracts to take away statutory fair use rights that researchers would otherwise hold under Federal law. In this case, the private contracts at issue are the electronic resource (e-resource) license agreements that academic research libraries sign to secure campus access to electronic journal, e-book, data, and other content that scholars need for their computational research.

Contractual override of fair use is a problem that disparately disadvantages U.S. researchers. As we have described elsewhere, more than forty countries, including the European Union, expressly reserve text mining and AI training rights for scientific research by research institutions. Not only do scholars in these countries not have to worry whether their computational research with AI is permitted, but also: They do not risk having those reserved rights overridden by contract. The European Union’s Copyright Digital Single Market Directive and recent AI Act nullify any attempt to circumscribe the text and data mining and AI training rights reserved for scientific research within research organizations. U.S. scholars are not as fortunate. 

In the U.S., most institutional e-resource licenses are negotiated and managed by research libraries, so it is imperative that scholars work closely with their libraries and advocate to preserve their computational research and AI training rights within the e-resource license agreements that universities sign. To that end, we have developed adaptable licensing language to support institutions in doing that nationwide. But while this language is helpful, the onus of advocacy and negotiation for those rights in the contracting process remains. Personally, we have found it helpful to explain to publishers that they must consent to these terms in the European Union, and can do so in the U.S. as well. That, combined with strong faculty and administrative support (such as at the University of California), makes for a strong stance against curtailment of these rights.

But we think there are additional practical ways for libraries to illustrate—both to publishers and scholarly authors—exactly what would happen to the advancement of knowledge if publishers’ licensing efforts to curtail AI training were successful. One way to do that is by “unpacking” or decoding a publisher’s proposed licensing restriction, and then demonstrating the impact that provision would have on research projects that were never objectionable to publishers before, and should not be now. We’ll take that approach below.

Decoding a publisher restriction

A commercial publisher recently proposed the following clause in an e-resource agreement:

Customer [the university] and its Authorized Users [the scholars] may not:

  1. directly or indirectly develop, train, program, improve, and/or enrich any artificial intelligence tool (“AI Tool”) accessible to anyone other than Customer and its Authorized Users, whether developed internally or provided by a third party; or
  2. reproduce or redistribute the Content to any third-party AI Tool, except to the extent limited portions of the Content are used solely for research and academic purposes (including to train an algorithm) and where the third-party AI Tool (a) is used locally in a self-hosted environment or closed hosted environment solely for use by Customer or Authorized Users; (b) is not trained or fine-tuned using the Content or any part thereof; and (c) does not share the Content or any part thereof with a third party.  

What does this mean?

  • The first paragraph forbids the training or improving of any AI tool if it’s accessible or released to third parties. And, it further forbids the use of any computational outputs or analysis that are derived from the licensed content from being used to train any tool available to third parties. 
  • The second paragraph is perhaps even more concerning. It provides that when using third party AI tools of any kind, a scholar can use only limited portions of the licensed content with the tools, and are prohibited from doing any training at all of third party tools even if it’s a non-generative AI tool and the scholar is performing the work in a completely closed and highly secure research environment.

What would the impact of such a restrictive licensing provision be on research? 

It would mean that every single one of the trained tools in the following projects could never be disseminated. In addition, for the projects below that used third-party AI tools, the research would have been prohibited full-stop because the third-party tools in those projects required training which the publisher above is attempting to prevent:

Tools that could not be disseminated

  1. In 2017, chemists created and trained a generative AI tool on 12,000 published research papers regarding synthesis conditions for metal oxides, so that the tool could identify anticipated chemical outputs and reactions for any given set of synthesis conditions entered into the tool. The generative tool they created is not capable of reproducing or redistributing any licensed content from the papers; it has merely learned conditions and outcomes and can predict chemical reactions based on those conditions and outcomes. And this beneficial tool would be prohibited from dissemination under the publisher’s terms identified above.
  2. In 2018, researchers trained an AI tool (that they had originally created in 2014) to understand whether a character is “masculine” or “feminine” by looking at the tacit assumptions expressed in words associated with that character. That tool can then look at other texts and identify masculine or feminine characters based on what it knows from having been trained before. The implications are that scholars can therefore use texts from different time periods with the tool to study representations of masculinity and femininity over time. No licensed content, no licensed or copyrighted books from a publisher can ever be released to the world by sharing the trained tool; the trained tool is merely capable of topic modeling—but the publisher’s above language would prohibit its dissemination nevertheless. 

Tools that could neither be trained nor disseminated 

  1. In 2019, authors used text from millions of books published over 100 years to analyze cultural meaning. They did this by training third-party non-generative AI word-embedding models called Word2Vec and GLoVE on multiple textual archives. The tools cannot reproduce content: when shown new text, they merely represent words as numbers, or vectors, to evaluate or predict how similar words in a given space are semantically or linguistically. The similarity of words can reveal cultural shifts in understanding of socioeconomic factors like class over time. But the publisher’s above licensing terms would prohibit the training of the tools to begin with, much less the sharing of them to support further or different inquiry. 
  2. In 2023, scholars trained a third-party-created open-source natural language processing (NLP) tool called Chemical Data Extractor (CDE). Among other things, CDE can be used to extract chemical information and properties identified in scholarly papers. In this case, the scholars wanted to teach CDE to parse a specific type of chemical information: metal-organic frameworks, or MoFs. Generally speaking, the CDE tool works by breaking sentences into “tokens” like parts of speech and referenced chemicals. By correlating tokens, one can determine that a particular chemical compound has certain synthetic properties, topologies, reactions with solvents, etc. The scholars trained CDE specifically to parse MoF names, synthesis methods, inorganic precursors, and more—and then exported the results into an open source database that identifies the MoF properties for each compound. Anyone can now use both the trained CDE tool and the database of MoF properties to ask different chemical property questions or identify additional MoF production pathways—thereby improving materials science for all. Neither the CDE tool nor the MoF database reproduces or contains the underlying scholarly papers that the tool learned from. Yet, neither the training of this third-party CDE tool nor its dissemination would be permitted under the publisher’s restrictive licensing language cited above.

Indeed, there are hundreds of AI tools that scholars have trained and disseminated—tools that do not reproduce licensed content—and that scholars have created or fine-tuned to extract chemical information, recognize faces, decode conversations, infer character types, and so much more. Restrictive licensing language like that shown above suppresses research inquiries and societal benefits that these tools make possible. It may also disproportionately affect the advancement of knowledge in or about developing countries, which may lack the resources to secure licenses or be forced to rely on open-source or poorly-coded public data—hindering journalism, language translation, and language preservation.

Protecting access to facts

Why are some publishers doing this? Perhaps to reserve the opportunity to develop and license their own scholarship-trained AI tools, which they could then license at additional cost back to research institutions. We could speculate about motivations, but the upshot is that publishers have been pushing hard to foreclose scholars from training and dissemination AI tools that now “know” something based on the licensed content. That is, such publishers wish to prevent tools from learning facts about the licensed content. 

However, this is precisely the purpose of licensing content. When institutions license content for their scholars to read, they are doing so for the scholars to learn information from the content. When scholars write about it or teach about the content, they are not regenerating the actual expression from the content—the part that is protected by copyright; rather the scholars are conveying the lessons learned from the content—facts not protected by copyright. Prohibiting the training of AI tools and the dissemination of those tools is functionally equivalent to prohibiting scholars from learning anything about the content that institutions are licensing for that very purpose, and that scholars have written to begin with! Publishers should not be able to monopolize the dissemination of information learned from scholarly content, and especially when that information is used non-commercially.

For these reasons, when we negotiate to preserve AI usage and training rights, we generally try to achieve the following outcomes which would promote—rather than prohibit—all of the research projects described above:

The sample language we’ve disseminated empowers others to negotiate for these outcomes. We hope that, when coupled with the advocacy tools we’ve provided above, scholars and libraries can protect their AI usage and training rights, while also being equipped to consider how they want their own works to be used.

Developing a public-interest training commons of books

Posted December 5, 2024
Photo by Zetong Li on Unsplash

Authors Alliance is pleased to announce a new project, supported by the Mellon Foundation, to develop an actionable plan for a public-interest book training commons for artificial intelligence. Northeastern University Library will be supporting this project and helping to coordinate its progress.

Access to books will play an essential role in how artificial intelligence develops. AI’s Large Language Models (LLMs) have a voracious appetite for text, and there are good reasons to think that these data sets should include books and lots of them. Over the last 500 years, human authors have written over 129 million books. These volumes, preserved for future generations in some of our most treasured research libraries, are perhaps the best and most sophisticated reflection of all human thinking. Their high editorial quality, breadth, and diversity of content, as well as the unique way they employ long-form narratives to communicate sophisticated and nuanced arguments and ideas make them ideal training data sources for AI.

These collections and the text embedded in them should be made available under ethical and fair rules as the raw material that will enable the computationally intense analysis needed to inform new AI models, algorithms, and applications imagined by a wide range of organizations and individuals for the benefit of humanity. 

Currently, AI development is dominated by a handful of companies that, in their rush to beat other competitors, have paid insufficient attention to the diversity of their inputs, questions of truth and bias in their outputs, and questions about social good and access. Authors Alliance, Northeastern University Library, and our partners seek to correct this tilt through the swift development of a counterbalancing project that will focus on AI development that builds upon the wealth of knowledge in nonprofit libraries and that will be structured to consider the views of all stakeholders, including authors, publishers, researchers, technologists, and stewards of collections. 

The main goal of this project is to develop a plan for either establishing a new organization or identifying the relevant criteria for an existing organization (or partnership of organizations) to take on the work of creating and stewarding a large-scale public interest training commons of books.

We seek to answer several key questions, such as: 

  • What are the right goals and mission for such an effort, taking into account both the long and short-term;
  • What are the technical and logistical challenges that might differ from existing library-led efforts to provide access to collections as data;
  • How to develop a sufficiently large and diverse corpus to offer a reasonable alternative to existing sources;
  • What a public-interest governance structure should look like that takes into account the particular challenges of AI development;
  • How do we, as a collective of stakeholders from authors and publishers to students, scholars, and libraries, sustainably fund such a commons, including a model for long-term sustainability for maintenance, transformation, and growth of the corpus over time;
  • Which combination of legal pathways is acceptable to ensure books are lawfully acquired in a way that minimizes legal challenges;
  • How to respect the interests of authors and rightsholders by accounting for concerns about consent, credit, and compensation; and
  • How to distinguish between the different needs and responsibilities of nonprofit researchers, small market entrants, and large commercial actors.

The project will include two meetings during 2025 to discuss these questions and possible ways forward, additional research and conversations with stakeholders, and the development and release of an ambitious yet achievable roadmap.

Support Authors Alliance!

Posted December 3, 2024

As we end the year, I’m writing to ask for your financial support by giving toward our end-of-year campaign (click here to donate online).

In May, Authors Alliance marked its 10th anniversary. We’ve experienced tremendous support and enthusiasm for our work over the last decade, and your collaboration has been an important part of our success. I hope you’ll help Authors Alliance take on our next decade. 

We’re proud of our work promoting authorship for the public good by supporting authors who write to be read. In the past year, we secured expanded copyright exemptions for text and data mining researchhelped defend authors’ fair use rights in courtlaunched an important initiative to clarify legal pathways for open access to federally funded research, and much more. We’ve also continued to help authors develop a deeper understanding of how complex policy issues can affect their work, drawing over 20,000 attendees for our in-person and online events on topics such as text and data mining, open access, artificial intelligence, and competition law. 

For 2025, we have our work cut out for us. As policymakers actively consider changes to how the law accommodates free expression, access to information, and new technology, we continue to find that we are among the only voices defending authors’ rights to research, write, and share their work for the benefit of the public. Your support for Authors Alliance will help us continue to speak out in support of authors who value the public interest.

Donate Online Today

Thank you,
Dave Hansen
Executive Director

New White Paper on Open Access and U.S. Federal Information Policy

Posted November 18, 2024
Photo by Sara Cottle on Unsplash

Authors Alliance and SPARC have released the first of four planned white papers addressing legal issues surrounding open access to scholarly publications under the 2022 OSTP memo (the “Nelson Memo”). The white papers are part of a larger project (described here) to support legal pathways to open access. 

This first paper discusses the “Federal Purpose License,” which is newly relevant to discussions of federal public access policies in light of the Nelson Memo.

The white paper is available here and supporting materials are here.

The FPL, found in 2 C.F.R. § 200.315(b), works like any other copyright licensing agreement between two parties. It is a voluntary agreement between author and agency that, as a condition of federal funding, the agency reserves a nonexclusive license to “reproduce, publish, or otherwise use the work for Federal purposes and to authorize others to do so.” The FPL was updated, effective October 1, to clarify that the reserved license specifically includes the right to deposit copyrighted works produced pursuant to a grant in agency-designated public access repositories.

With the OSTP memos instructing all agencies to make the results of federally-funded projects available to the public immediately upon publication, the FPL provides an elegant legal basis for doing so. Because the FPL is a signed, written, non-exclusive license that springs to life immediately when copyright in the works vest, it survives any future transfers of rights in the work. As a part of Uniform Guidance for all grant-making agencies, it provides consistency across federal grants, simplifying things for grant recipients, who have plenty of other things to worry about (it’s not entirely uniform, though, since some agencies have supplemented the FPL with License text of their own, expanding their rights under the License).

This protects both agencies and authors. Agencies must have permission in order to host and distribute works in their repositories. The FPL ensures that the agency has that authorization and that it continues even after publication rights have been subsequently assigned to a publisher. Meanwhile, authors are—or will be—required under their grant agreements to deposit their federally-funded peer-reviewed articles in the agency’s designated repository. The FPL ensures that, even if an author were to sign exclusive rights in a work to a publisher prior to complying with the deposit mandate, the author could still do so, despite no longer having any rights in the work herself.

The paper analyzes two ambiguous points in the FPL, namely, the scope of what rights agencies have as “Federal purposes” and what rights the agency may subsequently authorize for third parties. As there are no clear answers to these questions, the paper does not draw conclusions; it does, however, attempt to give some context and basis for how to interpret the FPL.

The next papers in this series will explore issues surrounding the legal authority underlying the public access policy, article versioning, and the policy’s interaction with institutional IP policies. Stay tuned for more!

Book Talk: The Line: AI and the Future of Personhood

Join us for a book talk with legal scholar JAMES BOYLE, discussing his book THE LINE: AI and the Future of Personhood, in conversation with KATE DARLING of the MIT Media Lab.

November 19th at 10am PT / 1pm ET
REGISTER NOW!

Chatbots like ChatGPT have challenged human exceptionalism: we are no longer the only beings capable of generating language and ideas fluently. But is ChatGPT conscious? Or is it merely engaging in sophisticated mimicry? And what happens in the future if the claims to consciousness are more credible? In The Line, James Boyle explores what these changes might do to our concept of personhood, to “the line” we believe separates our species from the rest of the world, but also separates “persons” with legal rights from objects.

The personhood wars—over the rights of corporations, animals, over the question of when life begins and ends—have always been contentious. We’ve even denied the personhood of members of our own species. How will those old fights affect the new ones, and vice versa? Boyle pursues those questions across a dizzying array of fields. He discusses moral philosophy and science fiction, transgenic species, nonhuman animals, the surprising history of corporate personality, and AI itself. Engaging with empathy and anthropomorphism, courtroom battles on behalf of chimps, and doom-laden projections about the threat of AI, The Line offers fascinating and thoughtful answers to questions about our future that are arriving sooner than we think.

You can find a free, open access copy of The Line here, and you can purchase a copy directly from MIT Press here.

ABOUT OUR SPEAKERS

JAMES BOYLE is the William Neal Reynolds Professor of Law at Duke Law School, founder of the Center for the Study of the Public Domain, and former Chair of Creative Commons. He is the author of The Public Domain and Shamans, Software, and Spleens, the coauthor of two comic books, and the winner of the Electronic Frontier Foundation’s Pioneer Award for his work on digital civil liberties.

DR. KATE DARLING is a Research Scientist at the MIT Media Lab and leads the Robotics Ethics & Society research team at the Boston Dynamics AI Institute. She is the author of The New Breed: What Our History with Animals Reveals about Our Future with Robots. Kate’s work explores the intersections of robotic technology and society, with a particular interest in legal, economic, social, and ethical issues.

Join Us! November 19th, 10am PT / 1pm ET

REGISTER NOW!

The DMCA 1201 Rulemaking: Summary, Key Takeaways, and Other Items of Interest

Posted November 8, 2024

Last month, we blogged about the key takeaways from the 2024 TDM exemptions recently put in place by the Librarian of Congress, including how the 2024 exemptions (1) expand researchers’ access to existing corpora, (2) definitively allow the viewing and annotation of copyrighted materials for TDM research purposes, and (3) create new obligations for researchers to disclose security protocols to trade associations. Beyond these key changes, the TDM exemptions remain largely the same: researchers affiliated with universities are allowed to circumvent TPMs to compile corpora for TDM research, provided that those copies of copyrighted materials are legally obtained and adequate security protocols are put in place.

We have since updated our resources page on Text and Data Mining and have incorporated the new developments into our TDM report: Text and Data Mining Under U.S. Copyright Law: Landscape, Flaws & Recommendations.

In this blog post, we share some further reflections on the newly expanded TDM exemptions—including (1) the use of AI tools in TDM research, (2) outside researchers’ access to existing corpora, (3) the disclosure requirement, and (4) a potential TDM licensing market—as well as other insights that emerged during the 9th triennial rulemaking.

The TDM Exemption

In other jurisdictions, such as the EU, Singapore, and Japan, legal provisions that permit “text data mining” also allow a broad array of uses, such as general machine learning and generative AI model training. In the US, exemptions allowing TDM so far have not explicitly addressed whether AI could be used as a tool for conducting TDM research. In this round of remaking, we were able to gain clarity on how AI tools are allowed to aid TDM research. Advocates for the TDM exemptions provided ample examples of how machine learning and AI are key to conducting TDM research and asked that “generative AI” not be deemed categorically impermissible as a tool for TDM research. The Copyright Office agreed that a wide array of tools could be utilized for TDM research under the exemptions, including AI tools, as long as the purpose is to conduct “scholarly text and data mining research and teaching.” The Office was careful to limit its analysis to those uses and not address other applications such as compiling data—or reusing existing TDM corpora—for training generative AI models; those are an entirely separate issue from facilitating non-commercial TDM research.

Besides clarifying that AI tools are allowed for TDM research and that viewing and annotation are permitted for copyrighted materials, the new exemptions offer meaningful improvement to TDM researchers’ access to corpora. The previous 2021 exemptions allowed access for purposes of “collaboration,” but many researchers interpreted that narrowly, and the Office confirmed that “collaboration” was not meant to encompass outside research projects entirely unrelated to the original research for which the corpus was created. Under the 2021 exemptions, a TDM corpus could only be accessed by outside researchers if they are working on the same research project as the original compiler of the corpus. The 2024 exemptions’ expansion of access to existing corpora has two main components and advantages. 

The expansion now allows for new research projects to be conducted on existing corpora, permitting institutions that have created a corpus to provide access “to researchers affiliated with other nonprofit institutions of higher education, with all access provided only through secure connections and on the condition of authenticated credentials, solely for purposes of text and data mining research or teaching.” At the same time, it also opens up new possibilities for researchers at institutions who otherwise would not have access, as the new exemption does not require a precondition that the outside researchers’ institutions otherwise own copies of works in the corpora. The new exemptions pose some important limitations: only researchers at institutions of higher education are allowed this access, and nothing more than “access” is allowed—it does not, for example, allow the transfer of a corpus for local use. 

The Office emphasized the need for adequate security protections, pointing back to cases such as Authors Guild v. Google and Authors Guild v. HathiTrust, which emphasized how careful both organizations were, respectively, to prevent their digitized corpora from being misused. To take advantage of this newly expanded TDM exemption, it will be crucial for universities to provide adequate IT support to ensure that technical barriers do not impede TDM researchers. That said, the record for the exemption shows that existing users are exceedingly conscientious when it comes to security. There have been zero reported instances of security breaches or lapses related to TDM corpora being compiled and used under the exemptions. 

As we previously explained, the security requirements are changed in a few ways. The new rule clarifies that trade associations can send inquiries on behalf of rightsholders. However, inquiries must be supported by a “reasonable belief” that the sender’s works are in a corpus being used for TDM research. It remains to be seen how the new obligation to disclose security measures to trade associations would impact TDM researchers and their institutions. The Register circuitously called out demands by trade associations sent to digital humanities researchers in the middle of the exemption process with a two-week response deadline as unreasonable and quoted NTIA (which provides input on the exemptions) in agreement that  “[t]he timing, targeting, and tenor of these requests [for institutions to disclose their security protocols] are disturbing.”  We are hopeful that this discouragement from the Copyright Office will prevent any future large-scale harassment towards TDM researchers and their institutions, but we will also remain vigilant in case trade associations were to abuse this new power. 

Alongside the concerns over disclosure requirements, we have some questions about the Copyright Office’s treatment of fair use as a rationale for circumventing TPMs for TDM research. The Register restated her 2021 conclusion that “under Authors Guild, Inc. v. HathiTrust, lost licensing revenue should only be considered ‘when the use serves as a substitute for the original.’” The Office, in its recommendations, placed considerable weight on the lack of a viable licensing market for TDM, which raises a concern that, in the Office’s view, a use that once was fair and legal might lose that status when the rightsholder starts to offer an adequate licensing option. While this may never become a real issue for the existing TDM exemptions (because no sufficient licensing options exist for TDM researchers, and for the breadth and depth of content needed, it seems unlikely to ever develop), it nonetheless contributes to the growing confusion surrounding the stability of a fair use defense in the face of new licensing markets. 

These concerns highlight the need for ongoing advocacy in the realm of TDM research. Overall, the Register of Copyright recognizes TDM as “a relatively new field that is quickly evolving.” This means that we could ask the Library of Congress to relax the limitations placed on TDM if we can point to legitimate research-related purposes. But, due to the nature of this process, it also means TDM researchers do not have a permanent and stable right to circumvent TPMs. As the exemptions remain subject to review every three years, many large trade associations advocate for the TDM exemptions to be greatly limited or even canceled, wishing to stifle independent TDM research. We will continue to advocate for TDM researchers, as we did during the 8th and 9th triennial rulemaking. 

Looking beyond the TDM exemption, we noted a few other developments: 

Warhol has not fundamentally changed fair use

First, the Opponents of the renewal of the existing exemptions repeatedly pointed to Warhol Foundation v. Goldsmith—the Supreme Court’s most recent fair use opinion—to argue that it has changed the fair use analysis such that the existing exemptions should not be renewed. For example, the Opponents argued that the fair use analysis for repairing medical devices changed under Warhol because, according to them, commercial nontransformative uses were less likely to be fair. The Copyright Office did not agree. The Register said that the same fair use analysis as in 2021 applied and that the Opponents failed “to show that the Warhol decision constitutes intervening legal precedent rendering the Office’s prior fair use analysis invalid.” In another instance where the Opponents tried to argue that commerciality must be given more weight under Warhol, the Register pointed out that under Warhol commerciality is not dispositive and must be weighed against the purpose of the new use.  The arguments for revisiting the 2021 fair use analyses were uniformly rejected, which we think is good news for those of us who believe Warhol should be read as making a modest adjustment to fair use and not a wholesale reworking of the fair use doctrine. 

Does ownership and control of copies matter for access? 

One of the requests before the Office was an expansion of an exemption that allows for access to preservation copies of computer programs and video games. The Office rejected the main thrust of the request but, in doing so, also provided an interesting clarification that may reveal some of the Office’s thinking about the relationship between fair use and access to copies owned by the user: 

The Register concludes that proponents did not show that removing the single user limitation for preserved computer programs or permitting off-premises access to video games are likely to be noninfringing. She also notes the greater risk of market harm with removing the video game exemption’s premises limitation, given the market for legacy video games. She recommends clarifying the single copy restriction language to reflect that preservation institutions can allow a copy of a computer program to be accessed by as many individuals as there are circumvented copies legally owned.”

That sounds a lot like an endorsement of the idea that the owned-to-loaned ratio, a key concept in the controlled digital lending analysis, should matter in the fair use analysis (which is something the Hachette v. Internet Archive controlled digital lending court gave zero weight to). For future 1201 exemptions, we will have to wait and see whether the Office will use this framework in other contexts. 

Addressing other non-copyright and AI questions in the 1201 process

The Librarian of Congress’s final rule included a number of notes on issues not addressed by the rulemaking: 

“The Librarian is aware that the Register and her legal staff have invested a great deal of time over the past two years in analyzing the many issues underlying the 1201 process and proposed exemptions. 

Through this work, the Register has come to believe that the issue of research on artificial intelligence security and trustworthiness warrants more general Congressional and regulatory attention. The Librarian agrees with the Register in this assessment. As a regulatory process focused on technological protection measures for copyrighted content, section 1201 is ill-suited to address fundamental policy issues with new technologies.” 

Proponents tried to argue that the software platforms’ restrictions and barriers to conducting AI research, such as their account requirements, rate limits, and algorithmic safeguards, are circumventable TPMs under 1201, but the Register disagreed. The Register maintained that the challenges Proponents described arose not out of circumventable TPMs but out of third-party controlled Software as a Service platforms. This decision can be illuminating for TDM researchers seeking to conduct TDM research on online streaming media or social media posts.

The Librarian’s note went on to say: “The Librarian is further aware of the policy and legal issues involving a generalized ‘‘right to repair’’ equipment with embedded software. These issues have now occupied the White House, Congress, state legislatures, federal agencies, the Copyright Office, and the general public through multiple rounds of 1201 rulemaking. 

Copyright is but one piece in a national framework for ensuring the security, trustworthiness, and reliability of embedded software, as well as other copyright-protected technology that affects our daily lives. Issues such as these extend beyond the reach of 1201 and may require a broader solution, as noted by the NTIA.”

These notes give an interesting, though a bit confusing, insight into how the Librarian of Congress and the Copyright Office think about the role of 1201 rulemaking when they address issues that go beyond copyright’s core concerns. While we can agree that 1201 is ill-suited to address fundamental policy issues with new technology, it is also somewhat concerning that the Office and the Librarian view copyright more generally as part of a broader “national framework for ensuring the security, trustworthiness, and reliability of embedded software.”  While of course, copyright is sometimes used to further ends outside of its intended purpose, these issues are far from the core constitutional purpose of copyright law and we think they are best addressed through other means. 

Copyright Management Information, 1202(b), and AI

Posted October 30, 2024

This post is by Maria Crusey, a third-year law student at Washington University in St. Louis. Maria has been working with Authors Alliance this semester on a project exploring legal claims in the now 30+ pending copyright AI lawsuits. 

In the recent spate of copyright infringement lawsuits against AI developers, many plaintiffs allege violations of 17 U.S.C. § 1202(b) in their use of copyrighted works for training and development of AI systems.  

Section 1202(b) prohibits the “removal or alteration of copyright management information.” Compared to related provisions in 17 U.S.C. § 1201, which protects against circumvention of copyright protection systems, §1202(b) has seldom been litigated at the appellate level, and there’s a growing divide among district courts about whether §1202(b) should apply to derivative works, particularly those created using AI technology.

At first glance, §1202(b) appears to be a straightforward provision. However, the uptick in §1202(b) claims raises some challenging questions, namely: How does §1202(b) apply to the use of a copyrighted work as part of a dataset that must be cleaned, restructured, and processed in ways that separate copyright management information from the content itself? And how should 1202(b) apply to AI systems that may reproduce small portions of content contained in training data?  Answers to this question may have serious implications in the AI suits because violations of 1202(b) can come with hefty statutory damage awards – between $2,500 and $25,000 for each violation. Spread across millions of works, the damages could be staggering. How the courts resolve this issue could also impact many other reuses of copyrighted works–from analogous uses such as text data mining research to much more routine re-distribution of copyrighted works in other contexts. 

One of these AI cases has requested that the Ninth Circuit Court of Appeals accept an interlocutory appeal on just this issue, and we are waiting to see whether the court will accept it.

For an introduction to §1202(b) and observations on this question, among others, read on:

What is § 1202(b) and what is it intended to do?

Broadly, 17 U.S.C. § 1202 is a provision of the Digital Millennium Copyright Act (DMCA) that protects the integrity of copyright management information (“CMI”). Per §1202(c), CMI comprises certain information identifying a copyrighted work, often including the title, the name of the author, and terms and conditions for the use of a work.

Section 1202(b) forbids the alteration or removal of copyright management information. The section provides that:

“[n]o person shall, without the authority of the copyright owner or the law – 

(1) intentionally remove or alter any CMI,

(2) distribute or import for distribution CMI knowing that the CMI has been removed or altered without authority of the copyright owner or the law, or 

(3) distribute, import for distribution, or publicly perform works, copies of works or phonorecords, knowing that copyright management information has been removed or altered without authority of the copyright owner or the law, knowing, or with respect to civil remedies under section 1203, having reasonable grounds to know that it will induce, enable, facilitate, or conceal an infringement of any right under this title.”

17 U.S.C. § 1202(b).

Congress primarily aimed to limit the assistance and enablement of copyright infringement in its enactment of §1202(b). This purpose is evident in the legislative history of the provision. In an address to a congressional subcommittee prior to the adoption of the DMCA, the then–Register of Copyrights, Marybeth Peters, discussed the aims of §1202(b). First, Peters noted that the requirements of §1202(b) would make CMI more reliable and thus aid in the administrability of copyright law. Second, Peters stated that §1202(b) would help prevent instances of copyright infringement that could come from the removal of CMI. The idea is if a copyrighted work lacks CMI, there is a greater likelihood of infringement since others may use the work under the pretense that they are the author or copyright holder. In creating a statutory violation for a party’s removal of CMI, regardless of later infringing activity, §1202(b) functions as damage control against potential copyright infringement.

What are the essential elements of a § 1202(b) claim?

To have a claim under §1202(b), a plaintiff must allege particularized facts about the existence and alteration or removal of CMI. Additionally, some courts require a plaintiff to demonstrate that the defendant had knowledge that the CMI was being altered or removed and that the alteration or removal would enable copyright infringement. Finally, some courts have required plaintiffs to show that the work with the altered or removed CMI is an exact copy of the original work–what has become known as the “identicality” requirement. This last “identicality” requirement is one of the main issues in the AI lawsuits raising §1202(b) and is detailed further below.

→ The “Identicality” Requirement

Courts that have imposed “identicality” have required that plaintiffs demonstrate that the work with the removed CMI is an exact copy of the original work and thus is “identical,” except for the missing or altered CMI. 

Suppose, for example, a photographer owns the copyright to a photograph they took. The photographer adds CMI to the photograph and takes care to protect the integrity of the work as it is dispersed online. A third party captures the photograph posted on a website by taking a screenshot and removes the CMI from the copied image while keeping all other aspects of the original photograph the same. The screenshot with the removed CMI is an “exact copy” of the original photograph because the only difference between the copyrighted photograph and the screenshot is the removal of the CMI.

Federal courts are divided in imposing the identicality requirement for §1202(b) claims, though the circuit courts have not yet addressed the issue. Notably, district courts of the Ninth Circuit Court of Appeals have varied in their treatments of the identicality requirement. For example, the court for the District of Nevada in Oracle v. Rimini Street declined to impose the identicality requirement because the requirement may weaken the intended protections for copyright holders under §1202(b). Conversely, in Kirk Kara Corp. v. W. Stone & Metal Corp., a court in the Central District of California applied the identicality requirement, though it provided little explanation for why it adopted it. Application of the identicality requirement is also unsettled in district courts beyond the Ninth Circuit (see, for example, this Southern District of Texas case discussing at length the identicality requirement and rejecting it). 

What are the §1202(b) claims at issue in the present suits?

The claims in Doe 1 v. Github exemplify the §1202(b) issues common among the present suits, and it is the Github suit that is presently before the Ninth Circuit Court of Appeals to take, if it wishes, on appeal.  

In Github, owners of copyrights in software code brought a suit against GitHub, a software developer platform. The plaintiffs alleged that Microsoft Copilot, an AI product developed in part by GitHub, illegally removed CMI from their works. The plaintiffs stored their software in GitHub’s publicly accessible software repositories under open-source license agreements. The plaintiffs claimed that GitHub removed CMI from their code and trained the Copilot AI model on the code in violation of the license agreements. Moreover, the plaintiffs claimed that, when prompted to generate software code, Copilot includes unique aspects of the plaintiffs’ code in its outputs. In their complaint, the plaintiffs alleged that all requirements for a valid § 1202(b) claim were met in the present suit. The plaintiffs stressed that, in removing CMI, the defendants failed to prevent users of products from making non-infringing use of the product. Consequently, they claim, the defendants removed the CMI, knowing that it would “induce, enable, facilitate, and/or conceal infringement” of copyrights in violation of the DMCA.

Regarding the §1202(b) claims, the parties contest the application of the identicality requirement. The plaintiffs first argue that § 1202 contains no such requirement: “The plain language of DMCA § 1202 makes it a violation to remove or alter CMI. It does not require that the output work be original or identical to obtain relief. . . By a plain reading of the statute, there is no need for a copy to be identical—there only needs to be copying, which Plaintiffs have amply alleged.” 

As a backstop, the plaintiffs further argue that Copilot does produce “near-identical reproduction[s]” of their copyrighted code and allege this is sufficient to fulfill the identicality requirement under §1202(b). Specifically, plaintiffs claimed that Copilot generates parts of plaintiffs’ code in extra lines of output code that are not relevant to input prompts. Plaintiffs also claimed Copilot generates their code in output code that produces errors due to a mismatch between the directly copied code and the code that would actually fit the prompt. To make this assertion work, plaintiffs distinguish their version of “identicality” –semantically equivalent lines of code–from a reproduction of the whole work. They argue that the defendant’s position, that “the reproduction of short passages that may be part of [a] larger work, rather than the reproduction of an entire work, is insufficient to violate Section 1202,” would lead to absurd results. “By OpenAI’s logic, a party could copy and distribute a fragment of a copyrighted work—say, a chapter of a book, a stanza of a poem, or a scene from a movie—and face no repercussions for infringement.” 

 In their reply, the defendants countered that §1202, which defines CMI as relating to a “copy of a work,” requires a complete and identical copy, not just snippets. Defendants noted that the plaintiffs have conceded that Copilot reproduces only snippets of code rather than complete versions of the code. Therefore, the defendants argue, Copilot does not create “identical copies” of the plaintiffs’ complete copyrighted works. The argument is based on both the text of the statute (they note that the statute only provides for liability when distributing copies that CMI has been stripped from, not derivatives, abridgments, or other adaptations), and they bolster those arguments by suggesting that allowing 1202 claims for incomplete copies would create chaos for ordinary uses of copyrighted works: “On Plaintiffs’ reading of § 1202, if someone opened an anthology of poetry and typed up a modified version of a single “stanza of a poem,” . . . without including the anthology’s copyright page, a § 1202(b) claim would lie. Plaintiffs’ reading effectively concedes that they are attempting to turn every garden-variety claim of copyright infringement into a DMCA claim, only without the usual limitations and defenses applicable under copyright law. Congress intended no such thing.” 

The GitHub court has addressed the issue now several times: it initially dismissed the plaintiffs’ §1202(b)(1) and (b)(3) claims, subsequently denied the plaintiffs’ motion for reconsideration of the claims, allowed the plaintiffs to amend their complaint and try again with more specificity, then dismissed the claims again. The reasoning of the court has been consistent, and largely focused on insufficient allegations of identicality. The court agreed with Defendants that the identicality requirement should apply and that the snippets do not satisfy the requirement. Following the dismissal, the plaintiffs sought and received permission from the district court to file an interlocutory appeal (an appeal on a specific issue before the case is fully resolved– something not usually allowed) to the Court of Appeals for the Ninth Circuit to determine whether § 202(b)(1) and (b)(3) impose an identicality requirement. The Ninth Circuit is presently considering whether to hear the appeal.

What would the Ninth Circuit assess in the appeal, and what are the implications of the appeal for future lawsuits?

If the appeal is accepted, the Ninth Circuit will determine whether §1202(b)(1) and (b)(3) actually impose an identicality requirement. Moreover, with regard to the facts of the Github case, the court will decide whether the identicality requirement requires exact copying of a complete copyrighted work, or perhaps something less. The Ninth Circuit’s hearing of this appeal would be notable for a number of reasons.

First, as mentioned above, §1202(b) is largely unaddressed by the circuit courts, and explicit appellate guidance has only been provided for the knowledge requirement referenced above. Consequently, determinations of §1202(b) claims are largely informed by varying district court decisions that are binding only on the parties to the suits and provide inconsistent interpretations of the requirements for a claim under the provision. An appellate ruling that accepts or rejects the identicality requirement would create additional binding authority to further clarify courts’ interpretations of §1202(b).

Second, a ruling on the identicality requirement from the Ninth Circuit specifically would be notable because it would be binding on the large number of §1202(b) claims presently being litigated in the Ninth Circuit’s lower courts. And, given the centrality of AI developers operating in California and elsewhere in the Ninth Circuit, the outcome of the appeal would significantly impact future lawsuits that involve §1202(b) claims.

It is hard to predict how the Ninth Circuit might rule, but we can work through some of the implications of the choices the court would have before it: 

If the Ninth Circuit interprets the identicality requirement as requiring a complete and exact copy, it would impose a high standard for the requirement and plaintiffs would likely be constrained in their ability to bring §1202(b) claims. If the court did this, the Github plaintiffs’ claims would likely fail as the alleged copied snippets of code generated by Copilot are not exact copies and do not comprise the complete copyrighted works. This hypothetical standard would be advantageous for individuals who remove CMI from copyrighted works in the course of processing them using AI as well as those who deploy AI systems that produce small portions of content similar (but not exactly so) to inputs.  So long as the works being processed or distributed are not complete exact copies, individuals would be free to alter the CMI of the works for ease in analyzing the copyrighted information. 

Alternatively, the Ninth Circuit could adopt a loose interpretation of identicality in which incomplete and inexact copying would be sufficient. One approach would be to require identicality but not copying of the entire work (something the plaintiffs in the Github suit advocate for). How the parties or the Ninth Circuit would formulate what standard would apply to this “less than entire” but still “near identical” standard is hard to say, but presumably, plaintiffs would have an easier time alleging facts sufficient for a §1202(b) claim. Applied to Github, it still seems unclear that the copied snippets of the plaintiffs’ code in the Copilot outputs could pass muster (this is likely a factual question to be determined at later stages of the litigation). But it could allow claims to at least survive an early motion to dismiss. As such, the adoption of this standard could limit how AI developers engage with works but also potentially affect others, such as researchers using similar techniques to process, clean, and distribute small portions of copyrighted works as part of a dataset.

Finally, the Ninth Circuit may decide to do away with the identicality requirement altogether. While this may seem like a potential boon to plaintiffs, who could allege that removal of CMI and distribution of some copied material, no matter how small, plaintiffs would still face substantial challenges.  Elimination of the identicality requirement would likely lead to greater weight being placed on the knowledge requirement in courts’ assessments of §1202(b) claims, which requires that defendants know or have reasonable grounds to know that their actions will “induce, enable, facilitate, or conceal an infringement.” In the context of the Github case, even without an identicality requirement, plaintiffs §1202(b) claims contain scant factual allegations about the defendants’ CMI removal and knowledge in the court filings to date. For other developers and users of AI, the effects of not having an identicality requirement would likely vary on a case-by-case basis. 

Conclusion

Recent copyright infringement suits and the pending appeal to the Ninth Circuit in Doe 1 v. Github demonstrate that §1202(b) is having its day in the sun. Although the provision has been overlooked and infrequently litigated in the past, the scope of protections granted by §1202(b) is important for understanding whether and how AI developers can remove CMI when using copyrighted works to process, restructure, and analyze copyrighted works for AI development. Thus, as lawsuits against AI developers and users continue to progress, the requirements to have a valid §1202(b) claim are sure to become even more contentious.

Text Data Mining Research DMCA Exemption Renewed and Expanded

Posted October 25, 2024
U.S. Copyright Office 1201 Rulemaking Process, taken from https://www.copyright.gov/1201/

Earlier today, the Library of Congress, following recommendations from the U.S. Copyright Office, released its final rule adopting exemptions to the Digital Millenium Copyright Act’s prohibition on circumvention of technological protection measures (e.g.,  DRM).  

As many of you know, we’ve been working closely with members of the text and data-mining community as well as our co-petitioners, the Library Copyright Alliance (LCA) and the American Association of University Professors (AAUP), to petition for renewal of the existing TDM research exemption and to expand it to allow researchers to share their research corpora with other researchers outside of their university (something not previously allowed). The process began over a year ago and followed an in-depth review process by the U.S. Copyright Office, and we’re incredibly grateful for the expert legal representation before the Office over this past year by UC Berkeley Law’s Samuelson Law, Technology & Public Policy Clinic, and in particular clinic faculty Erik Stallman, Jennifer Urban and Berkeley Law students Christian Howard-Sukhil, Zhudi Huang, and Matthew Cha.

We are very pleased to see that the Librarian of Congress both approved the renewal of the existing exemption and approved an expansion that allows for research universities to provide access to TDM corpora for use by researchers at other universities. 

The expanded rule is poised to make an immediate impact in helping the TDM researchers collaborate and build upon each other’s work. As Allison Cooper,  director of Kinolab and Associate Professor of Romance Languages and Literatures and Cinema Studies at Bowdoin College, explains:

“This decision will have an immediate impact on the ongoing close-up project that Joel Burges, Emily Sherwood, and I are working on by allowing us to collaborate with researchers like David Bamman, whose expertise in machine learning will be valuable in answering many of the ‘big picture’ questions about the close-up that have come up in our work so far.”

These are the main takeaways from the new rule: 

  • The exemption has been expanded to allow “access” to corpora by researchers at other institutions “solely for purposes of text and data mining research or teaching.” There is no more requirement that access be granted as part of a “collaboration,” so new researchers can ask new and different questions of a corpus. Access must be credentialed and authenticated.
  • The issue of whether a researcher can engage in “close viewing” of a copyrighted work has been resolved—as the explanation for the revised rule puts it, researchers can “view the contents of copyrighted works as part of their research, provided that any viewing that takes place is in furtherance of research objectives (e.g., processing or annotating works to prepare them for analysis) and not for the works’ expressive value.” This is a very helpful clarification!
  • The new rule also modified the existing security requirements, which provide that researchers must put in place adequate security protocols to protect TDM corpora from unauthorized reuse and must share information about those security protocols with rightsholders upon request. That rule has been limited in some ways and expanded in others. The new rule clarifies that trade associations can send inquiries on behalf of rightsholders. However, inquiries must be supported by a “reasonable belief” that the sender’s works are in a corpus being used for TDM research.

Later on, we will post a more in-depth analysis of the new rules–both TDM and others that apply to authors. The Librarian of Congress also authorized the renewal of a number of other rules that support research, teaching, and library preservation. Among them is a renewal of another exemption that Authors Alliance and AAUP petitioned for, allowing for the circumvention of digital locks when using motion picture excerpts in multi-media ebooks. 

Thank you to all of the many, many TDM researchers and librarians we’ve worked with over the last several years to help support this petition. 

You can learn more about TDM and our work on this issue through our TDM resources page, here.

Who Represents You in the AI Copyright Lawsuits? 

Posted October 16, 2024

Sara Silverman is the author of The Bedwetter, a comedy memoir.  Richard Kadrey wrote Sandman Slim, a fantasy novel series. Christopher Golden, a supernatural thriller titled Ararat. 

These authors might not seem to have much in common with an academic author who writes in history, physics, or chemistry. Or a journalist. Or a poet. Or, for that matter, me, writing this blog post.  And yet, these authors may end up representing us all in court. 

A large number of the recent AI copyright lawsuits are class action lawsuits. This means that these lawsuits are brought by a small number of plaintiffs who (subject to judicial approval) are granted the right to represent a much larger class. In many of the AI copyright lawsuits,  the proposed classes are extraordinarily broad, including many creators who might be surprised that they are being represented. If you live in the US and wrote something that was published online, there is a good chance that you are included in multiple of these classes. 

A very brief background on class action lawsuits

Class actions can be an efficient way of resolving disputes that involve lots of people, allowing for a single resolution that binds many parties when there are common interests and facts. As you can imagine, the class action mechanism can also attract misuse, for example, by plaintiffs (and their attorneys) who may seek large settlements on behalf of a large number of people. Those settlements may benefit the named plaintiffs and their attorneys but they aren’t really aligned with the interests of most class members. 

There are rules in place to prevent that kind of abuse.  In federal courts (where all copyright lawsuits must be brought), Rule 23 of the Federal Rule of Civil Procedure governs. It provides that:

“One or more members of a class may sue or be sued as representative parties on behalf of all members only if: 
(1) the class is so numerous that joinder of all members is impracticable; [“numerosity”]
(2) there are questions of law or fact common to the class; [“commonality”]
(3) the claims or defenses of the representative parties are typical of the claims or defenses of the class; [“typicality”] and
(4) the representative parties will fairly and adequately protect the interests of the class. [“adequacy”]”

The rest of Rule 23 contains a number of other safeguards to protect both class members and defendants. Among them are requirements that the court must certify that the class complies with rule 23,  that any proposed settlements be approved by the court,  and that class members receive notice of any proposed settlement and an opportunity to object. Additionally, there are a number of rules to ensure that the law firm bringing the suit can fairly and competently represent the class members. 

Class definition and class representatives in the copyright AI lawsuits

We believe it’s important for creators to pay attention to these suits because if a class is certified and that class includes those creators, the class representatives will have meaningful legal authority to speak on their behalf.  

Rule 23  provides that “at an early practicable time after a person sues,” the court must decide whether to certify the proposed class. Though we are now well over a year into some of the earliest suits filed, this has yet to happen. In the meantime what we have are proposed class definitions offered by plaintiffs. How broadly or narrowly a class is defined by the plaintiffs will be one of the most important factors in whether the class can be certified since it will directly affect the commonality of facts among the class, the typicality of claims, and whether the representatives can fairly and adequately represent the interests of the class. Plaintiffs have the burden of proving that they have satisfied Rule 23. 

In these AI lawsuits, we see some themes in terms of class representative and proposed classes, with many offering very broad class definitions. For example, in the now-consolidated In re OpenAI ChatGPT Litigation, the class representatives are 11 fiction writers of books such as The Cabin at the End of the World, The Brief Wondrous Life of Oscar Wao, What the Dead Know and others. 

They propose to represent a class defined as follows:  

“All persons or entities domiciled in the United States that own a United States copyright in any work that was used as training data for the OpenAI Language Models during the Class Period [defined as June 28, 2020 to the present].” 

This kind of broad “anyone with a copyright in a work used for training” approach to class definition is repeated in a few other suits. For example, the consolidated Kadrey v. Meta lawsuit has a similar (and overlapping) grouping of fiction author class representatives and an almost identical proposed class definition. Dubus v. NVIDIA is another suit that takes essentially the same approach. 

Other AI lawsuits have more variation in class representatives. Huckabee v. Bloomberg, for example, is another suit with a similar class definition (basically, all copyrighted works owned by someone in the US and used for training Bloomberg’s LLM) but with class representatives that are a bit different: mostly authors of religious books and of course, Mike Huckabee, a politician. 

There is at least one class action that is more precise both in terms of proposed class representatives and their relation to the proposed class definition. The now-consolidated Authors Guild v. OpenAI suit has some 28 proposed class representatives, most of whom are authors of best-selling fiction and non-fiction trade books, 14 of whom are members of the Authors Guild. In this suit, the plaintiffs propose two classes: one for fiction authors and one for non-fiction authors. It also places some restrictions around them: class members for fiction works must be “natural persons” who are “sole authors of, and legal or beneficial owners of Eligible Copyrights in” fictional works that were registered with the U.S. Copyright Office and used for training the defendants’ LLMs (and this includes persons who are beneficiaries of works held by literary estates). For nonfiction authors, class members are “[a]ll natural persons, literary trusts, and literary estates in the United States who are legal or beneficial owners of Eligible Nonfiction Copyrights’ which the complaint defines as works used to train defendants’ LLMs and that have an ISBN with the exception of any books classified as reference works (BISAC code REF). 

Some challenges and dangers
When you consider the scale and scope of materials used to train the AI models in question, you can immediately see some of the challenges that are likely to arise with relatively small groups of authors attempting to represent practically all individual U.S. copyright owners. 

While the exact training materials used for the models at issue remain opaque, it’s definitely true that they were not just trained on modern fiction. There is widespread acknowledgment that these models are trained on a large amount of content scraped from across the internet using data sources such as Common Crawl. This, in effect, means that these suits implicate the rights of millions of rights holders, with interests as diverse as those of YouTube content creators, computer programmers, novelists, academics, and more. 

How can these representatives fairly and adequately represent such a broad and diverse group–especially when many may disagree with the underlying motivations for the suit to begin with–is a tough question. Even the Authors Guild consolidated case, which is much more careful in terms of class definition, includes classes that are breathtakingly broad when one considers the diversity of authorship within them. The fiction author class, for example, could include everyone from NY Times bestselling authors to fan fiction writers. The nonfiction class, which is at least limited to nonfiction book authors of works assigned an ISBN, could similarly include everyone from authors of popular self-help books distributed by the millions to scholarly books with print runs in the low hundreds and distributed online on open-access terms. The interests, financial and otherwise, of those authors can vary significantly. 

Beyond the adequacy of representatives (along with questions about whether their experiences are really typical of others in the proposed class), there are other challenges unique to copyright law, for example, the opaque nature of ownership (there is no official public record of who owns what), making ascertaining who actually falls within the class an initial challenge. Compounding that, there are a dizzying variety of unique terms under which works are distributed online, some of which may afford AI developers a viable defense for many works. A fair use defense also requires some level of assessment of the nature of the works used, a fact-intensive inquiry that will vary from one work to another. This just scratches the surface of some of the issues that likely mean there really aren’t common questions of law or fact among the class. 

Conclusion
There are good reasons to think that the classes as currently defined in these lawsuits are too broad. For some of the reasons mentioned above, I think it will be difficult for courts to certify them as is. But this doesn’t mean authors and other rightsholders should sit back and assume that their interests won’t be co-opted by others in these suits who seek to represent them. We don’t know when the courts will actually address these class certification issues in these suits. When they do, it will be important for authors to speak up.