Category Archives: News

The DMCA 1201 Rulemaking: Summary, Key Takeaways, and Other Items of Interest

Posted November 8, 2024

Last month, we blogged about the key takeaways from the 2024 TDM exemptions recently put in place by the Librarian of Congress, including how the 2024 exemptions (1) expand researchers’ access to existing corpora, (2) definitively allow the viewing and annotation of copyrighted materials for TDM research purposes, and (3) create new obligations for researchers to disclose security protocols to trade associations. Beyond these key changes, the TDM exemptions remain largely the same: researchers affiliated with universities are allowed to circumvent TPMs to compile corpora for TDM research, provided that those copies of copyrighted materials are legally obtained and adequate security protocols are put in place.

We have since updated our resources page on Text and Data Mining and have incorporated the new developments into our TDM report: Text and Data Mining Under U.S. Copyright Law: Landscape, Flaws & Recommendations.

In this blog post, we share some further reflections on the newly expanded TDM exemptions—including (1) the use of AI tools in TDM research, (2) outside researchers’ access to existing corpora, (3) the disclosure requirement, and (4) a potential TDM licensing market—as well as other insights that emerged during the 9th triennial rulemaking.

The TDM Exemption

In other jurisdictions, such as the EU, Singapore, and Japan, legal provisions that permit “text data mining” also allow a broad array of uses, such as general machine learning and generative AI model training. In the US, exemptions allowing TDM so far have not explicitly addressed whether AI could be used as a tool for conducting TDM research. In this round of remaking, we were able to gain clarity on how AI tools are allowed to aid TDM research. Advocates for the TDM exemptions provided ample examples of how machine learning and AI are key to conducting TDM research and asked that “generative AI” not be deemed categorically impermissible as a tool for TDM research. The Copyright Office agreed that a wide array of tools could be utilized for TDM research under the exemptions, including AI tools, as long as the purpose is to conduct “scholarly text and data mining research and teaching.” The Office was careful to limit its analysis to those uses and not address other applications such as compiling data—or reusing existing TDM corpora—for training generative AI models; those are an entirely separate issue from facilitating non-commercial TDM research.

Besides clarifying that AI tools are allowed for TDM research and that viewing and annotation are permitted for copyrighted materials, the new exemptions offer meaningful improvement to TDM researchers’ access to corpora. The previous 2021 exemptions allowed access for purposes of “collaboration,” but many researchers interpreted that narrowly, and the Office confirmed that “collaboration” was not meant to encompass outside research projects entirely unrelated to the original research for which the corpus was created. Under the 2021 exemptions, a TDM corpus could only be accessed by outside researchers if they are working on the same research project as the original compiler of the corpus. The 2024 exemptions’ expansion of access to existing corpora has two main components and advantages. 

The expansion now allows for new research projects to be conducted on existing corpora, permitting institutions that have created a corpus to provide access “to researchers affiliated with other nonprofit institutions of higher education, with all access provided only through secure connections and on the condition of authenticated credentials, solely for purposes of text and data mining research or teaching.” At the same time, it also opens up new possibilities for researchers at institutions who otherwise would not have access, as the new exemption does not require a precondition that the outside researchers’ institutions otherwise own copies of works in the corpora. The new exemptions pose some important limitations: only researchers at institutions of higher education are allowed this access, and nothing more than “access” is allowed—it does not, for example, allow the transfer of a corpus for local use. 

The Office emphasized the need for adequate security protections, pointing back to cases such as Authors Guild v. Google and Authors Guild v. HathiTrust, which emphasized how careful both organizations were, respectively, to prevent their digitized corpora from being misused. To take advantage of this newly expanded TDM exemption, it will be crucial for universities to provide adequate IT support to ensure that technical barriers do not impede TDM researchers. That said, the record for the exemption shows that existing users are exceedingly conscientious when it comes to security. There have been zero reported instances of security breaches or lapses related to TDM corpora being compiled and used under the exemptions. 

As we previously explained, the security requirements are changed in a few ways. The new rule clarifies that trade associations can send inquiries on behalf of rightsholders. However, inquiries must be supported by a “reasonable belief” that the sender’s works are in a corpus being used for TDM research. It remains to be seen how the new obligation to disclose security measures to trade associations would impact TDM researchers and their institutions. The Register circuitously called out demands by trade associations sent to digital humanities researchers in the middle of the exemption process with a two-week response deadline as unreasonable and quoted NTIA (which provides input on the exemptions) in agreement that  “[t]he timing, targeting, and tenor of these requests [for institutions to disclose their security protocols] are disturbing.”  We are hopeful that this discouragement from the Copyright Office will prevent any future large-scale harassment towards TDM researchers and their institutions, but we will also remain vigilant in case trade associations were to abuse this new power. 

Alongside the concerns over disclosure requirements, we have some questions about the Copyright Office’s treatment of fair use as a rationale for circumventing TPMs for TDM research. The Register restated her 2021 conclusion that “under Authors Guild, Inc. v. HathiTrust, lost licensing revenue should only be considered ‘when the use serves as a substitute for the original.’” The Office, in its recommendations, placed considerable weight on the lack of a viable licensing market for TDM, which raises a concern that, in the Office’s view, a use that once was fair and legal might lose that status when the rightsholder starts to offer an adequate licensing option. While this may never become a real issue for the existing TDM exemptions (because no sufficient licensing options exist for TDM researchers, and for the breadth and depth of content needed, it seems unlikely to ever develop), it nonetheless contributes to the growing confusion surrounding the stability of a fair use defense in the face of new licensing markets. 

These concerns highlight the need for ongoing advocacy in the realm of TDM research. Overall, the Register of Copyright recognizes TDM as “a relatively new field that is quickly evolving.” This means that we could ask the Library of Congress to relax the limitations placed on TDM if we can point to legitimate research-related purposes. But, due to the nature of this process, it also means TDM researchers do not have a permanent and stable right to circumvent TPMs. As the exemptions remain subject to review every three years, many large trade associations advocate for the TDM exemptions to be greatly limited or even canceled, wishing to stifle independent TDM research. We will continue to advocate for TDM researchers, as we did during the 8th and 9th triennial rulemaking. 

Looking beyond the TDM exemption, we noted a few other developments: 

Warhol has not fundamentally changed fair use

First, the Opponents of the renewal of the existing exemptions repeatedly pointed to Warhol Foundation v. Goldsmith—the Supreme Court’s most recent fair use opinion—to argue that it has changed the fair use analysis such that the existing exemptions should not be renewed. For example, the Opponents argued that the fair use analysis for repairing medical devices changed under Warhol because, according to them, commercial nontransformative uses were less likely to be fair. The Copyright Office did not agree. The Register said that the same fair use analysis as in 2021 applied and that the Opponents failed “to show that the Warhol decision constitutes intervening legal precedent rendering the Office’s prior fair use analysis invalid.” In another instance where the Opponents tried to argue that commerciality must be given more weight under Warhol, the Register pointed out that under Warhol commerciality is not dispositive and must be weighed against the purpose of the new use.  The arguments for revisiting the 2021 fair use analyses were uniformly rejected, which we think is good news for those of us who believe Warhol should be read as making a modest adjustment to fair use and not a wholesale reworking of the fair use doctrine. 

Does ownership and control of copies matter for access? 

One of the requests before the Office was an expansion of an exemption that allows for access to preservation copies of computer programs and video games. The Office rejected the main thrust of the request but, in doing so, also provided an interesting clarification that may reveal some of the Office’s thinking about the relationship between fair use and access to copies owned by the user: 

The Register concludes that proponents did not show that removing the single user limitation for preserved computer programs or permitting off-premises access to video games are likely to be noninfringing. She also notes the greater risk of market harm with removing the video game exemption’s premises limitation, given the market for legacy video games. She recommends clarifying the single copy restriction language to reflect that preservation institutions can allow a copy of a computer program to be accessed by as many individuals as there are circumvented copies legally owned.”

That sounds a lot like an endorsement of the idea that the owned-to-loaned ratio, a key concept in the controlled digital lending analysis, should matter in the fair use analysis (which is something the Hachette v. Internet Archive controlled digital lending court gave zero weight to). For future 1201 exemptions, we will have to wait and see whether the Office will use this framework in other contexts. 

Addressing other non-copyright and AI questions in the 1201 process

The Librarian of Congress’s final rule included a number of notes on issues not addressed by the rulemaking: 

“The Librarian is aware that the Register and her legal staff have invested a great deal of time over the past two years in analyzing the many issues underlying the 1201 process and proposed exemptions. 

Through this work, the Register has come to believe that the issue of research on artificial intelligence security and trustworthiness warrants more general Congressional and regulatory attention. The Librarian agrees with the Register in this assessment. As a regulatory process focused on technological protection measures for copyrighted content, section 1201 is ill-suited to address fundamental policy issues with new technologies.” 

Proponents tried to argue that the software platforms’ restrictions and barriers to conducting AI research, such as their account requirements, rate limits, and algorithmic safeguards, are circumventable TPMs under 1201, but the Register disagreed. The Register maintained that the challenges Proponents described arose not out of circumventable TPMs but out of third-party controlled Software as a Service platforms. This decision can be illuminating for TDM researchers seeking to conduct TDM research on online streaming media or social media posts.

The Librarian’s note went on to say: “The Librarian is further aware of the policy and legal issues involving a generalized ‘‘right to repair’’ equipment with embedded software. These issues have now occupied the White House, Congress, state legislatures, federal agencies, the Copyright Office, and the general public through multiple rounds of 1201 rulemaking. 

Copyright is but one piece in a national framework for ensuring the security, trustworthiness, and reliability of embedded software, as well as other copyright-protected technology that affects our daily lives. Issues such as these extend beyond the reach of 1201 and may require a broader solution, as noted by the NTIA.”

These notes give an interesting, though a bit confusing, insight into how the Librarian of Congress and the Copyright Office think about the role of 1201 rulemaking when they address issues that go beyond copyright’s core concerns. While we can agree that 1201 is ill-suited to address fundamental policy issues with new technology, it is also somewhat concerning that the Office and the Librarian view copyright more generally as part of a broader “national framework for ensuring the security, trustworthiness, and reliability of embedded software.”  While of course, copyright is sometimes used to further ends outside of its intended purpose, these issues are far from the core constitutional purpose of copyright law and we think they are best addressed through other means. 

Copyright Management Information, 1202(b), and AI

Posted October 30, 2024

This post is by Maria Crusey, a third-year law student at Washington University in St. Louis. Maria has been working with Authors Alliance this semester on a project exploring legal claims in the now 30+ pending copyright AI lawsuits. 

In the recent spate of copyright infringement lawsuits against AI developers, many plaintiffs allege violations of 17 U.S.C. § 1202(b) in their use of copyrighted works for training and development of AI systems.  

Section 1202(b) prohibits the “removal or alteration of copyright management information.” Compared to related provisions in 17 U.S.C. § 1201, which protects against circumvention of copyright protection systems, §1202(b) has seldom been litigated at the appellate level, and there’s a growing divide among district courts about whether §1202(b) should apply to derivative works, particularly those created using AI technology.

At first glance, §1202(b) appears to be a straightforward provision. However, the uptick in §1202(b) claims raises some challenging questions, namely: How does §1202(b) apply to the use of a copyrighted work as part of a dataset that must be cleaned, restructured, and processed in ways that separate copyright management information from the content itself? And how should 1202(b) apply to AI systems that may reproduce small portions of content contained in training data?  Answers to this question may have serious implications in the AI suits because violations of 1202(b) can come with hefty statutory damage awards – between $2,500 and $25,000 for each violation. Spread across millions of works, the damages could be staggering. How the courts resolve this issue could also impact many other reuses of copyrighted works–from analogous uses such as text data mining research to much more routine re-distribution of copyrighted works in other contexts. 

One of these AI cases has requested that the Ninth Circuit Court of Appeals accept an interlocutory appeal on just this issue, and we are waiting to see whether the court will accept it.

For an introduction to §1202(b) and observations on this question, among others, read on:

What is § 1202(b) and what is it intended to do?

Broadly, 17 U.S.C. § 1202 is a provision of the Digital Millennium Copyright Act (DMCA) that protects the integrity of copyright management information (“CMI”). Per §1202(c), CMI comprises certain information identifying a copyrighted work, often including the title, the name of the author, and terms and conditions for the use of a work.

Section 1202(b) forbids the alteration or removal of copyright management information. The section provides that:

“[n]o person shall, without the authority of the copyright owner or the law – 

(1) intentionally remove or alter any CMI,

(2) distribute or import for distribution CMI knowing that the CMI has been removed or altered without authority of the copyright owner or the law, or 

(3) distribute, import for distribution, or publicly perform works, copies of works or phonorecords, knowing that copyright management information has been removed or altered without authority of the copyright owner or the law, knowing, or with respect to civil remedies under section 1203, having reasonable grounds to know that it will induce, enable, facilitate, or conceal an infringement of any right under this title.”

17 U.S.C. § 1202(b).

Congress primarily aimed to limit the assistance and enablement of copyright infringement in its enactment of §1202(b). This purpose is evident in the legislative history of the provision. In an address to a congressional subcommittee prior to the adoption of the DMCA, the then–Register of Copyrights, Marybeth Peters, discussed the aims of §1202(b). First, Peters noted that the requirements of §1202(b) would make CMI more reliable and thus aid in the administrability of copyright law. Second, Peters stated that §1202(b) would help prevent instances of copyright infringement that could come from the removal of CMI. The idea is if a copyrighted work lacks CMI, there is a greater likelihood of infringement since others may use the work under the pretense that they are the author or copyright holder. In creating a statutory violation for a party’s removal of CMI, regardless of later infringing activity, §1202(b) functions as damage control against potential copyright infringement.

What are the essential elements of a § 1202(b) claim?

To have a claim under §1202(b), a plaintiff must allege particularized facts about the existence and alteration or removal of CMI. Additionally, some courts require a plaintiff to demonstrate that the defendant had knowledge that the CMI was being altered or removed and that the alteration or removal would enable copyright infringement. Finally, some courts have required plaintiffs to show that the work with the altered or removed CMI is an exact copy of the original work–what has become known as the “identicality” requirement. This last “identicality” requirement is one of the main issues in the AI lawsuits raising §1202(b) and is detailed further below.

→ The “Identicality” Requirement

Courts that have imposed “identicality” have required that plaintiffs demonstrate that the work with the removed CMI is an exact copy of the original work and thus is “identical,” except for the missing or altered CMI. 

Suppose, for example, a photographer owns the copyright to a photograph they took. The photographer adds CMI to the photograph and takes care to protect the integrity of the work as it is dispersed online. A third party captures the photograph posted on a website by taking a screenshot and removes the CMI from the copied image while keeping all other aspects of the original photograph the same. The screenshot with the removed CMI is an “exact copy” of the original photograph because the only difference between the copyrighted photograph and the screenshot is the removal of the CMI.

Federal courts are divided in imposing the identicality requirement for §1202(b) claims, though the circuit courts have not yet addressed the issue. Notably, district courts of the Ninth Circuit Court of Appeals have varied in their treatments of the identicality requirement. For example, the court for the District of Nevada in Oracle v. Rimini Street declined to impose the identicality requirement because the requirement may weaken the intended protections for copyright holders under §1202(b). Conversely, in Kirk Kara Corp. v. W. Stone & Metal Corp., a court in the Central District of California applied the identicality requirement, though it provided little explanation for why it adopted it. Application of the identicality requirement is also unsettled in district courts beyond the Ninth Circuit (see, for example, this Southern District of Texas case discussing at length the identicality requirement and rejecting it). 

What are the §1202(b) claims at issue in the present suits?

The claims in Doe 1 v. Github exemplify the §1202(b) issues common among the present suits, and it is the Github suit that is presently before the Ninth Circuit Court of Appeals to take, if it wishes, on appeal.  

In Github, owners of copyrights in software code brought a suit against GitHub, a software developer platform. The plaintiffs alleged that Microsoft Copilot, an AI product developed in part by GitHub, illegally removed CMI from their works. The plaintiffs stored their software in GitHub’s publicly accessible software repositories under open-source license agreements. The plaintiffs claimed that GitHub removed CMI from their code and trained the Copilot AI model on the code in violation of the license agreements. Moreover, the plaintiffs claimed that, when prompted to generate software code, Copilot includes unique aspects of the plaintiffs’ code in its outputs. In their complaint, the plaintiffs alleged that all requirements for a valid § 1202(b) claim were met in the present suit. The plaintiffs stressed that, in removing CMI, the defendants failed to prevent users of products from making non-infringing use of the product. Consequently, they claim, the defendants removed the CMI, knowing that it would “induce, enable, facilitate, and/or conceal infringement” of copyrights in violation of the DMCA.

Regarding the §1202(b) claims, the parties contest the application of the identicality requirement. The plaintiffs first argue that § 1202 contains no such requirement: “The plain language of DMCA § 1202 makes it a violation to remove or alter CMI. It does not require that the output work be original or identical to obtain relief. . . By a plain reading of the statute, there is no need for a copy to be identical—there only needs to be copying, which Plaintiffs have amply alleged.” 

As a backstop, the plaintiffs further argue that Copilot does produce “near-identical reproduction[s]” of their copyrighted code and allege this is sufficient to fulfill the identicality requirement under §1202(b). Specifically, plaintiffs claimed that Copilot generates parts of plaintiffs’ code in extra lines of output code that are not relevant to input prompts. Plaintiffs also claimed Copilot generates their code in output code that produces errors due to a mismatch between the directly copied code and the code that would actually fit the prompt. To make this assertion work, plaintiffs distinguish their version of “identicality” –semantically equivalent lines of code–from a reproduction of the whole work. They argue that the defendant’s position, that “the reproduction of short passages that may be part of [a] larger work, rather than the reproduction of an entire work, is insufficient to violate Section 1202,” would lead to absurd results. “By OpenAI’s logic, a party could copy and distribute a fragment of a copyrighted work—say, a chapter of a book, a stanza of a poem, or a scene from a movie—and face no repercussions for infringement.” 

 In their reply, the defendants countered that §1202, which defines CMI as relating to a “copy of a work,” requires a complete and identical copy, not just snippets. Defendants noted that the plaintiffs have conceded that Copilot reproduces only snippets of code rather than complete versions of the code. Therefore, the defendants argue, Copilot does not create “identical copies” of the plaintiffs’ complete copyrighted works. The argument is based on both the text of the statute (they note that the statute only provides for liability when distributing copies that CMI has been stripped from, not derivatives, abridgments, or other adaptations), and they bolster those arguments by suggesting that allowing 1202 claims for incomplete copies would create chaos for ordinary uses of copyrighted works: “On Plaintiffs’ reading of § 1202, if someone opened an anthology of poetry and typed up a modified version of a single “stanza of a poem,” . . . without including the anthology’s copyright page, a § 1202(b) claim would lie. Plaintiffs’ reading effectively concedes that they are attempting to turn every garden-variety claim of copyright infringement into a DMCA claim, only without the usual limitations and defenses applicable under copyright law. Congress intended no such thing.” 

The GitHub court has addressed the issue now several times: it initially dismissed the plaintiffs’ §1202(b)(1) and (b)(3) claims, subsequently denied the plaintiffs’ motion for reconsideration of the claims, allowed the plaintiffs to amend their complaint and try again with more specificity, then dismissed the claims again. The reasoning of the court has been consistent, and largely focused on insufficient allegations of identicality. The court agreed with Defendants that the identicality requirement should apply and that the snippets do not satisfy the requirement. Following the dismissal, the plaintiffs sought and received permission from the district court to file an interlocutory appeal (an appeal on a specific issue before the case is fully resolved– something not usually allowed) to the Court of Appeals for the Ninth Circuit to determine whether § 202(b)(1) and (b)(3) impose an identicality requirement. The Ninth Circuit is presently considering whether to hear the appeal.

What would the Ninth Circuit assess in the appeal, and what are the implications of the appeal for future lawsuits?

If the appeal is accepted, the Ninth Circuit will determine whether §1202(b)(1) and (b)(3) actually impose an identicality requirement. Moreover, with regard to the facts of the Github case, the court will decide whether the identicality requirement requires exact copying of a complete copyrighted work, or perhaps something less. The Ninth Circuit’s hearing of this appeal would be notable for a number of reasons.

First, as mentioned above, §1202(b) is largely unaddressed by the circuit courts, and explicit appellate guidance has only been provided for the knowledge requirement referenced above. Consequently, determinations of §1202(b) claims are largely informed by varying district court decisions that are binding only on the parties to the suits and provide inconsistent interpretations of the requirements for a claim under the provision. An appellate ruling that accepts or rejects the identicality requirement would create additional binding authority to further clarify courts’ interpretations of §1202(b).

Second, a ruling on the identicality requirement from the Ninth Circuit specifically would be notable because it would be binding on the large number of §1202(b) claims presently being litigated in the Ninth Circuit’s lower courts. And, given the centrality of AI developers operating in California and elsewhere in the Ninth Circuit, the outcome of the appeal would significantly impact future lawsuits that involve §1202(b) claims.

It is hard to predict how the Ninth Circuit might rule, but we can work through some of the implications of the choices the court would have before it: 

If the Ninth Circuit interprets the identicality requirement as requiring a complete and exact copy, it would impose a high standard for the requirement and plaintiffs would likely be constrained in their ability to bring §1202(b) claims. If the court did this, the Github plaintiffs’ claims would likely fail as the alleged copied snippets of code generated by Copilot are not exact copies and do not comprise the complete copyrighted works. This hypothetical standard would be advantageous for individuals who remove CMI from copyrighted works in the course of processing them using AI as well as those who deploy AI systems that produce small portions of content similar (but not exactly so) to inputs.  So long as the works being processed or distributed are not complete exact copies, individuals would be free to alter the CMI of the works for ease in analyzing the copyrighted information. 

Alternatively, the Ninth Circuit could adopt a loose interpretation of identicality in which incomplete and inexact copying would be sufficient. One approach would be to require identicality but not copying of the entire work (something the plaintiffs in the Github suit advocate for). How the parties or the Ninth Circuit would formulate what standard would apply to this “less than entire” but still “near identical” standard is hard to say, but presumably, plaintiffs would have an easier time alleging facts sufficient for a §1202(b) claim. Applied to Github, it still seems unclear that the copied snippets of the plaintiffs’ code in the Copilot outputs could pass muster (this is likely a factual question to be determined at later stages of the litigation). But it could allow claims to at least survive an early motion to dismiss. As such, the adoption of this standard could limit how AI developers engage with works but also potentially affect others, such as researchers using similar techniques to process, clean, and distribute small portions of copyrighted works as part of a dataset.

Finally, the Ninth Circuit may decide to do away with the identicality requirement altogether. While this may seem like a potential boon to plaintiffs, who could allege that removal of CMI and distribution of some copied material, no matter how small, plaintiffs would still face substantial challenges.  Elimination of the identicality requirement would likely lead to greater weight being placed on the knowledge requirement in courts’ assessments of §1202(b) claims, which requires that defendants know or have reasonable grounds to know that their actions will “induce, enable, facilitate, or conceal an infringement.” In the context of the Github case, even without an identicality requirement, plaintiffs §1202(b) claims contain scant factual allegations about the defendants’ CMI removal and knowledge in the court filings to date. For other developers and users of AI, the effects of not having an identicality requirement would likely vary on a case-by-case basis. 

Conclusion

Recent copyright infringement suits and the pending appeal to the Ninth Circuit in Doe 1 v. Github demonstrate that §1202(b) is having its day in the sun. Although the provision has been overlooked and infrequently litigated in the past, the scope of protections granted by §1202(b) is important for understanding whether and how AI developers can remove CMI when using copyrighted works to process, restructure, and analyze copyrighted works for AI development. Thus, as lawsuits against AI developers and users continue to progress, the requirements to have a valid §1202(b) claim are sure to become even more contentious.

Text Data Mining Research DMCA Exemption Renewed and Expanded

Posted October 25, 2024
U.S. Copyright Office 1201 Rulemaking Process, taken from https://www.copyright.gov/1201/

Earlier today, the Library of Congress, following recommendations from the U.S. Copyright Office, released its final rule adopting exemptions to the Digital Millenium Copyright Act’s prohibition on circumvention of technological protection measures (e.g.,  DRM).  

As many of you know, we’ve been working closely with members of the text and data-mining community as well as our co-petitioners, the Library Copyright Alliance (LCA) and the American Association of University Professors (AAUP), to petition for renewal of the existing TDM research exemption and to expand it to allow researchers to share their research corpora with other researchers outside of their university (something not previously allowed). The process began over a year ago and followed an in-depth review process by the U.S. Copyright Office. 

We are very pleased to see that the Librarian of Congress both approved the renewal of the existing exemption and approved an expansion that allows for research universities to provide access to TDM corpora for use by researchers at other universities. 

The expanded rule is poised to make an immediate impact in helping the TDM researchers collaborate and build upon each other’s work. As Allison Cooper,  director of Kinolab and Associate Professor of Romance Languages and Literatures and Cinema Studies at Bowdoin College, explains:

“This decision will have an immediate impact on the ongoing close-up project that Joel Burges, Emily Sherwood, and I are working on by allowing us to collaborate with researchers like David Bamman, whose expertise in machine learning will be valuable in answering many of the ‘big picture’ questions about the close-up that have come up in our work so far.”

These are the main takeaways from the new rule: 

  • The exemption has been expanded to allow “access” to corpora by researchers at other institutions “solely for purposes of text and data mining research or teaching.” There is no more requirement that access be granted as part of a “collaboration,” so new researchers can ask new and different questions of a corpus. Access must be credentialed and authenticated.
  • The issue of whether a researcher can engage in “close viewing” of a copyrighted work has been resolved—as the explanation for the revised rule puts it, researchers can “view the contents of copyrighted works as part of their research, provided that any viewing that takes place is in furtherance of research objectives (e.g., processing or annotating works to prepare them for analysis) and not for the works’ expressive value.” This is a very helpful clarification!
  • The new rule also modified the existing security requirements, which provide that researchers must put in place adequate security protocols to protect TDM corpora from unauthorized reuse and must share information about those security protocols with rightsholders upon request. That rule has been limited in some ways and expanded in others. The new rule clarifies that trade associations can send inquiries on behalf of rightsholders. However, inquiries must be supported by a “reasonable belief” that the sender’s works are in a corpus being used for TDM research.

Later on, we will post a more in-depth analysis of the new rules–both TDM and others that apply to authors. The Librarian of Congress also authorized the renewal of a number of other rules that support research, teaching, and library preservation. Among them is a renewal of another exemption that Authors Alliance and AAUP petitioned for, allowing for the circumvention of digital locks when using motion picture excerpts in multi-media ebooks. 

Thank you to all of the many, many TDM researchers and librarians we’ve worked with over the last several years to help support this petition. 

You can learn more about TDM and our work on this issue through our TDM resources page, here.

Who Represents You in the AI Copyright Lawsuits? 

Posted October 16, 2024

Sara Silverman is the author of The Bedwetter, a comedy memoir.  Richard Kadrey wrote Sandman Slim, a fantasy novel series. Christopher Golden, a supernatural thriller titled Ararat. 

These authors might not seem to have much in common with an academic author who writes in history, physics, or chemistry. Or a journalist. Or a poet. Or, for that matter, me, writing this blog post.  And yet, these authors may end up representing us all in court. 

A large number of the recent AI copyright lawsuits are class action lawsuits. This means that these lawsuits are brought by a small number of plaintiffs who (subject to judicial approval) are granted the right to represent a much larger class. In many of the AI copyright lawsuits,  the proposed classes are extraordinarily broad, including many creators who might be surprised that they are being represented. If you live in the US and wrote something that was published online, there is a good chance that you are included in multiple of these classes. 

A very brief background on class action lawsuits

Class actions can be an efficient way of resolving disputes that involve lots of people, allowing for a single resolution that binds many parties when there are common interests and facts. As you can imagine, the class action mechanism can also attract misuse, for example, by plaintiffs (and their attorneys) who may seek large settlements on behalf of a large number of people. Those settlements may benefit the named plaintiffs and their attorneys but they aren’t really aligned with the interests of most class members. 

There are rules in place to prevent that kind of abuse.  In federal courts (where all copyright lawsuits must be brought), Rule 23 of the Federal Rule of Civil Procedure governs. It provides that:

“One or more members of a class may sue or be sued as representative parties on behalf of all members only if: 
(1) the class is so numerous that joinder of all members is impracticable; [“numerosity”]
(2) there are questions of law or fact common to the class; [“commonality”]
(3) the claims or defenses of the representative parties are typical of the claims or defenses of the class; [“typicality”] and
(4) the representative parties will fairly and adequately protect the interests of the class. [“adequacy”]”

The rest of Rule 23 contains a number of other safeguards to protect both class members and defendants. Among them are requirements that the court must certify that the class complies with rule 23,  that any proposed settlements be approved by the court,  and that class members receive notice of any proposed settlement and an opportunity to object. Additionally, there are a number of rules to ensure that the law firm bringing the suit can fairly and competently represent the class members. 

Class definition and class representatives in the copyright AI lawsuits

We believe it’s important for creators to pay attention to these suits because if a class is certified and that class includes those creators, the class representatives will have meaningful legal authority to speak on their behalf.  

Rule 23  provides that “at an early practicable time after a person sues,” the court must decide whether to certify the proposed class. Though we are now well over a year into some of the earliest suits filed, this has yet to happen. In the meantime what we have are proposed class definitions offered by plaintiffs. How broadly or narrowly a class is defined by the plaintiffs will be one of the most important factors in whether the class can be certified since it will directly affect the commonality of facts among the class, the typicality of claims, and whether the representatives can fairly and adequately represent the interests of the class. Plaintiffs have the burden of proving that they have satisfied Rule 23. 

In these AI lawsuits, we see some themes in terms of class representative and proposed classes, with many offering very broad class definitions. For example, in the now-consolidated In re OpenAI ChatGPT Litigation, the class representatives are 11 fiction writers of books such as The Cabin at the End of the World, The Brief Wondrous Life of Oscar Wao, What the Dead Know and others. 

They propose to represent a class defined as follows:  

“All persons or entities domiciled in the United States that own a United States copyright in any work that was used as training data for the OpenAI Language Models during the Class Period [defined as June 28, 2020 to the present].” 

This kind of broad “anyone with a copyright in a work used for training” approach to class definition is repeated in a few other suits. For example, the consolidated Kadrey v. Meta lawsuit has a similar (and overlapping) grouping of fiction author class representatives and an almost identical proposed class definition. Dubus v. NVIDIA is another suit that takes essentially the same approach. 

Other AI lawsuits have more variation in class representatives. Huckabee v. Bloomberg, for example, is another suit with a similar class definition (basically, all copyrighted works owned by someone in the US and used for training Bloomberg’s LLM) but with class representatives that are a bit different: mostly authors of religious books and of course, Mike Huckabee, a politician. 

There is at least one class action that is more precise both in terms of proposed class representatives and their relation to the proposed class definition. The now-consolidated Authors Guild v. OpenAI suit has some 28 proposed class representatives, most of whom are authors of best-selling fiction and non-fiction trade books, 14 of whom are members of the Authors Guild. In this suit, the plaintiffs propose two classes: one for fiction authors and one for non-fiction authors. It also places some restrictions around them: class members for fiction works must be “natural persons” who are “sole authors of, and legal or beneficial owners of Eligible Copyrights in” fictional works that were registered with the U.S. Copyright Office and used for training the defendants’ LLMs (and this includes persons who are beneficiaries of works held by literary estates). For nonfiction authors, class members are “[a]ll natural persons, literary trusts, and literary estates in the United States who are legal or beneficial owners of Eligible Nonfiction Copyrights’ which the complaint defines as works used to train defendants’ LLMs and that have an ISBN with the exception of any books classified as reference works (BISAC code REF). 

Some challenges and dangers
When you consider the scale and scope of materials used to train the AI models in question, you can immediately see some of the challenges that are likely to arise with relatively small groups of authors attempting to represent practically all individual U.S. copyright owners. 

While the exact training materials used for the models at issue remain opaque, it’s definitely true that they were not just trained on modern fiction. There is widespread acknowledgment that these models are trained on a large amount of content scraped from across the internet using data sources such as Common Crawl. This, in effect, means that these suits implicate the rights of millions of rights holders, with interests as diverse as those of YouTube content creators, computer programmers, novelists, academics, and more. 

How can these representatives fairly and adequately represent such a broad and diverse group–especially when many may disagree with the underlying motivations for the suit to begin with–is a tough question. Even the Authors Guild consolidated case, which is much more careful in terms of class definition, includes classes that are breathtakingly broad when one considers the diversity of authorship within them. The fiction author class, for example, could include everyone from NY Times bestselling authors to fan fiction writers. The nonfiction class, which is at least limited to nonfiction book authors of works assigned an ISBN, could similarly include everyone from authors of popular self-help books distributed by the millions to scholarly books with print runs in the low hundreds and distributed online on open-access terms. The interests, financial and otherwise, of those authors can vary significantly. 

Beyond the adequacy of representatives (along with questions about whether their experiences are really typical of others in the proposed class), there are other challenges unique to copyright law, for example, the opaque nature of ownership (there is no official public record of who owns what), making ascertaining who actually falls within the class an initial challenge. Compounding that, there are a dizzying variety of unique terms under which works are distributed online, some of which may afford AI developers a viable defense for many works. A fair use defense also requires some level of assessment of the nature of the works used, a fact-intensive inquiry that will vary from one work to another. This just scratches the surface of some of the issues that likely mean there really aren’t common questions of law or fact among the class. 

Conclusion
There are good reasons to think that the classes as currently defined in these lawsuits are too broad. For some of the reasons mentioned above, I think it will be difficult for courts to certify them as is. But this doesn’t mean authors and other rightsholders should sit back and assume that their interests won’t be co-opted by others in these suits who seek to represent them. We don’t know when the courts will actually address these class certification issues in these suits. When they do, it will be important for authors to speak up. 

Antitrust Lawsuit Filed Against Large Academic Publishers

Posted September 17, 2024

On September 12, a San Francisco-based law firm filed an antitrust lawsuit on behalf of UCLA professor Lucina Uddin against six prominent academic publishers and the trade association that represents them: Elsevier, John Wiley & Sons, Sage Publications, Springer Nature, Taylor & Francis, Wolters Kluwer, and the International Association of Scientific, Technical, and Medical Publishers (“STM”). The suit is brought on behalf of a class that it defines as “All natural persons residing in the United States who performed peer review services for, or submitted a manuscript for publication to, any of the Publisher Defendants’ peer-reviewed journals from September 12, 2020 to the present.” The complaint lists just one claim for relief: that “Publisher Defendants and their co-conspirators entered into and engaged in unlawful agreements in restraint of the trade and commerce described above in violation of Section 1 of the Sherman Act, 15 U.S.C. § 1.” 

To support this claim, the plaintiff makes three key allegations. Namely, that the publishers have illegally agreed amongst each other to abide by: 

  1. a “Single Submission Rule,” where researchers are only allowed to submit a manuscript to one journal for consideration unless the journal rejects it;
  2. a “Unpaid Peer Review Rule,” where journals implement policies to not compensate peer reviewers for their labor; and
  3. a “Gag Rule,” where researchers are not allowed to share or discuss their manuscript once they have submitted it to a journal for consideration before the journal publishes it.

Why would any of these actions constitute an antitrust violation? We thought a little background could be helpful: 

To understand this lawsuit, we must first consider the purpose of U.S. antitrust law. The fundamental goal of antitrust law is to encourage competition and ultimately to promote consumer welfare. The Supreme Court explains that: “Congress designed the Sherman Act as a consumer welfare prescription.” 

Section 1 of the Sherman Antitrust Act does this by prohibiting “[e]very contract, combination in the form of trust or otherwise, or conspiracy, in restraint of trade.” This generally requires proving two things: (1) some sort of agreement or business arrangement, and (2) that this agreement is “in restraint of trade,” i.e., unreasonably harmful to competition.  

Proving an agreement can sometimes be a complicated factual question, though often there are good clues, especially when joint activity is coordinated through a trade association (antitrust lawyers love to quote Adam Smith on trade associations: “People of the same trade seldom meet together, even for merriment and diversion, but the conversation ends in a conspiracy against the public, or in some contrivance to raise prices.”)  In this case, the plaintiff says that the agreements are so obvious that they are in fact published openly in several portions of the STM’s “International Ethical Principles for Scholarly Publication” which are then implemented and enforced by each publisher.  

Proving the second part, that the agreement is in restraint of trade or unreasonably harmful to competition can be more complicated. The courts have developed three different analytical frameworks for evaluating whether conduct harms competition in this context:

  1. A “per se” rule for agreements that are “nakedly” anticompetitive. Examples include agreements to fix prices (what’s alleged here, at least with respect to payment for peer review), bid-rigging, agreements to divide markets, and a few other kinds of less common agreements. 
  2. A “rule of reason” test—which applies in most cases—that weighs the pro-competitive effects of the agreement against anti-competitive effects and prohibits agreements only when the anti-competitive harms outweigh the benefits.
  3. “Intermediate” or “quick-look” scrutiny, which covers a small number of cases in which agreements look suspicious on their face but are not so obviously anticompetitive as to fall under the “per se” rule. 

The plaintiff’s complaint claims that the publishers’ agreements violate the Sherman Act regardless of which of these three tests apply.  But often, some of the most significant battles in antitrust lawsuits are about which of these standards apply since the costs associated with litigating a suit can change dramatically depending on which is used. If the court accepts that the “per se” standard applies, the plaintiff likely wins. If the “quick-look” rule applies, the burden is on the defendant to show that its conduct is not anticompetitive. But if the “rule of reason” standard applies, the suit will likely involve extensive discovery, expert witnesses, and other factual evidence about what exactly the market is, whether the defendants had sufficient market power to negatively affect competition, and whether the agreement would negatively or positively affect competition. 

In these cases, defining the relevant market is often crucial. In most antitrust cases, defining the relevant market involves identifying substitutes for the product under review. On this point, the plaintiff argues:  

 “Publication by peer-reviewed journals is a relevant antitrust market. For scholars who seek to communicate their scientific research, there is no adequate substitute for publication in a peer-reviewed journal. Peer-reviewed journals establish the validity of scientific research through the peer review process, communicate that research to the scientific community, and avoid competing claims to the same scientific discovery.”

For market definition, it is important to include only close substitutes and exclude those that are distant. For academic publishing, even though there are many ways authors can share their manuscripts—from emailing their colleagues, to posting on their personal websites, to publishing with upstart new journals—the fact remains that the journals in question are the primary means of dissemination and are the most heavily read and heavily cited. So at first glance, the complaint seems realistic in its formulation of the market at issue. 

The complaint then goes on to argue that the publishers hold significant power in this market and misuse that power, touching on familiar themes such as how these academic publishers extract significant profits, charge high rates for access and increasingly high fees for authors to publish openly, and so on. The plaintiffs allege that this market power has allowed the publishers to make agreements amongst each other (the three allegations noted above) in ways that allow them to maximize profits while also maintaining their market power. We note that it’s true that there may be some other explanations for these practices—in fact, some authors may be proponents of some of them, for example, the single-submission rule. But even if there are other explanations for these rules, with the current arrangement agreed through STM, the allegation is that no member publisher will even try to compete or develop other approaches that may drive up the price for peer reviewers, compete for placement of papers, etc.

How this lawsuit will turn out is hard to predict. This lawsuit flags some very problematic practices enforced on the academic publishing industry by prominent publishers. But there are many other problems with the academic publishing industry not discussed in the complaint. We’ve long thought that the public-interest nature of academic research and publishing is complicated when paired with commercial publishers who have strong incentives to maximize profits. Of course, even for-profit firms are expected to operate within the law; profit-maximizing to the point of adopting anti-competitive practices is fundamentally at odds with their essential social responsibilities.

Hachette v. Internet Archive Update: Second Circuit Court of Appeals Rules Against Internet Archive

Posted September 5, 2024

We got a disappointing decision yesterday from the Second Circuit Court of Appeals in the long-running Hachette v. Internet Archive (IA) copyright lawsuit about IA’s digitization and lending of books. The Court affirmed the district court’s decision that IA cannot circulate digital copies of books they have legitimately acquired in physical copies, even when only the same number of copies as legitimately acquired are circulated to a single user at a time—just as a physical book would be loaned.

The Court, focusing on IA’s lending of digitized books that were available for license as ebooks from the publishers, concluded that IA’s fair use defense fails. We think this decision will result in a meaningful reduction in access to knowledge. This is sad news for many authors who have relied on IA’s Open Library for research and discovery, and  for readers who have used Open Library to find authors works. However, we also view it as a decision limited to its facts—that is, IA’s particular implementation of controlled digital lending (CDL), and more specifically, its lending of books that are already available in licensed digital formats. 

We plan to do a more in-depth analysis of the Court’s decision later, but for now, we offer some initial thoughts. First, there are a couple of bright spots in the opinion: 

1) The Court rejected the district court’s conclusion that IA was engaged in commercial use when looking at the first factor of fair use. The publishers argued IA’s lending of digitized books was commercial in nature because IA received a few thousand dollars from a for-profit used-bookseller and also solicited donations on its website. The Court rightly pointed out that if that was the standard, virtually every nonprofit that solicits donations would by default only be able to engage in commercial use. This was an issue we and others strongly urged the Court to address, and we’re glad it did. 

2)  For the most part, the Court focused its analysis on the facts of the case, which was really about IA lending digitized copies of books that were already available in ebook form and licensable from the publishers. The legal analysis in several places turned on this fact, which we think leaves room to make fair use arguments regarding programs to digitize and make available other books, such as print books for which there is no licensed ebook available, out-of-print books, or orphan works. CDL will remain an important framework, especially considering the lack of an existing digital first-sale doctrine.  

We are also disappointed by several key points in the decision: 

One was the Court’s assessment of the first fair use factor, “purpose and character of the use.” The Court’s analysis of this factor was in some ways unsurprising but nevertheless disappointing. The Court did little more than conclude that the use was not transformative and, therefore, not fair use. Though we think there are strong arguments that CDL is transformative, whether CDL is “transformative” is just one of the supporting rationales for the argument that CDL is fair use. The other justifications—that CDL supports teaching, scholarship, and research, along with complementing the first sale doctrine and supporting the public-interest mission of libraries—are at the heart of CDL. The Court didn’t engage with those other arguments at all and also ignored meaningful discussion of cases where non-transformative copying supported a fair use finding because of the public benefits.

A second key issue is about whether IA’s digital lending negatively impacts the market for the original works. This issue probably deserves a whole blog post to itself, but in short the analysis came down to who shoulders the burden of proving or disproving market harm, and what default assumptions the court has about market harm.  The following quotes from the decision will give you a sense of how the Court analyzed the issue: 

[a]lthough they do not provide empirical data of their own, Publishers assert that they (1) have suffered market harm due to lost eBook licensing fees and (2) will suffer market harm in the future if IA’s practices were to become widespread.  IA argues that Publishers cannot rely on the “common-sense inference” of market harm without data to back that up, citing American Society for Testing & Materials v. Public.Resource.Org, Inc. [citations omitted]. . . . We agree with Publishers’ assessment of market harm. 

Despite IA’s experts having offered meaningful data and analysis indicating a lack of market harm on sales of publishers’ books, the Court went on to say: 

We are likewise convinced that “unrestricted and widespread conduct of the sort engaged in by [IA] would result in a substantially adverse impact on the potential market for [the Works in Suit]. . . . Though Publishers have not provided empirical data to support this observation, we routinely rely on such logical inferences where appropriate in assessing the fourth fair use factor. . . . Thus, we conclude it is “self-evident” that if IA’s use were to become widespread, it would adversely affect Publishers’ markets for the Works in Suit.

We are also disappointed by how the Court portrayed the overall public benefit of IA’s lending and its long-term effect: “while IA claims that prohibiting its practices would harm consumers and researchers, allowing its practices would―and does―harm authors.” We think this is a gross generalization and mischaracterization of how IA’s digital lending affects most authors. Authors are researchers. Authors are readers. IA’s digital library helps authors create new works and supports their interests in having their works read. This ruling may benefit the largest publishers and most prominent authors, but for most, it will end up harming more than it will help. 

Authors Alliance and SPARC Supporting Legal Pathways to Open Access for Scholarly Works

Posted August 27, 2024

Authors Alliance and SPARC are excited to announce a new collaboration to address critical legal issues surrounding open access to scholarly publications. 

One of our goals with this project is to clarify legal pathways to open access in support of federal agencies working to comply with the Memorandum on “Ensuring Free, Immediate, and Equitable Access to Federally Funded Research,” (the “Nelson Memo”) which was issued by the White House’s Office of Science and Technology Policy in 2022. For more than a decade, federal open access policy was based on an earlier memo instructing federal agencies with research and development budgets over $100 million to make their grant-funded research publicly accessible for free online. The Nelson Memo, drawing from lessons learned during the COVID-19 Pandemic, provides important updates to the prior policy. Among the key changes are extending the requirements to all agencies, regardless of budget, and eliminating the 12-month post-publication embargo period on articles. 

The Nelson Memo raises important legal questions for agencies, universities, and individual researchers to consider. To help ensure smooth implementation of the Nelson Memo, we plan to produce a series of white papers addressing these questions. For example, a central issue is the nature and extent of the pre-existing license, known as the “Federal Purpose License,” which all federal grant-making agencies have in works produced using federal funds.  The white papers will outline the background and history of the License, and also address commonly raised questions, including whether the License would support the application of Creative Commons or other public licenses; possible constitutional or statutory obstacles to the use of the License for public access; whether the License may apply to all versions of a work; and whether the use of the License for public access would require modification of university intellectual property policies. 

In addition to the white paper series, we plan to convene a group of experts to update the SPARC Author Addendum. The Addendum was created in 2007 and has been an extremely useful tool in educating authors on how to retain their rights, both to provide open access to their scholarship and to allow for wide use of their work. However, in the nearly two decades since its creation, models for open access and scholarly publishing have changed dramatically. We aim to update the Addendum to more closely reflect the present open access landscape and to help authors to better achieve their scholarship goals.

A final piece of the project is to develop a framework for universities looking to recover rights for faculty in their works, particularly backlist and out-of-print books that are unavailable in electronic form. Though the open access movement has made significant strides in advancing free availability and reuse of scholarly articles, that progress has generally not extended to books and other monographic works, in part because of the non-standard and often complicated nature of book publishing licenses. It has also not done as much to open backfile access to older journal articles. We think a framework for identifying opportunities to recover rights and relicense them under an open access license will help advance open access of these works.

Eric Harbeson

The project will be spearheaded by Eric Harbeson, who joined the Authors Alliance this week as Scholarly Publications Legal Fellow. Eric is a recent graduate of the University of Oregon School of Law. Prior to law school, Eric had a dual career as a librarian/archivist and a musicologist. Eric did extensive work advocating for libraries’ and archives’ copyright interests, especially with respect to preservation of music and sound recordings. Eric’s publications include a well-regarded report on the Music Modernization Act, as well as two scholarly music editions. Eric can be reached at eric@authorsalliance.org.

Clickbait arguments in AI Lawsuits (will number 3 shock you?)

Posted August 15, 2024

Image generated by Canva

The booming AI industry has sparked heated debates over what AI developers are legally allowed to do. So far, we have learned from the US Copyright Office and courts that AI created works are not protectable, unless it is combined with human authorship. 

As we monitor two dozen ongoing lawsuits and regulatory efforts that address various aspects of AI’s legality, we see legitimate legal questions that must be resolved. However, we also see some prominent yet flawed arguments that have been used to enflame discussions, particularly by publisher-plaintiffs and their supporters. For now, let’s focus on some clickbait arguments that sound appealing but are fundamentally baseless. 

Will AI doom human authorship?

Based on current research, AI tools can actually help authors improve creativity, productivity, as well as the longevity of their career

When AI tools such as ChatGPT first appeared online, many leading authors and creators publicly endorsed it as a useful tool like any other tech innovation that came before it. At the same time, many others claimed that authors and creators of lesser caliber will be disproportionately disadvantaged by the advent of AI. 

This intuition-driven hypothesis, that AI will be the bane of average authors, has so far proved to be misguided.

We now know that AI tools can greatly help authors during the ideation stage, especially for less creative authors. According to a study published last month, AI tools had minimal impact on the output of highly creative authors, but were able to enhance the works of less imaginative authors. 

AI can also serve as a readily-accessible editor for authors. Research shows that AI enhances the quality of routine communications. Without AI-powered tools, a less-skilled person will often struggle with the cognitive burden of managing data, which limits both the quality and quantity of their potential output. AI helps level the playing field by handling data-intensive tasks, allowing writers to focus more on making creative and other crucial decisions about their works. 

It is true that entirely AI-generated works of abysmal quality are available for purchase on some platforms. Some of these works are using human authors’ names without authorization. These AI-generated works may infringe on authors’ right of publicity, but they do not present commercially-viable alternatives to books authored by humans. Readers prefer higher-quality works produced with human supervision and interference (provided that digital platforms do not act recklessly towards their human authors despite generating huge profits from human authors).

Are lawsuits against AI companies brought with authors’ best interest in mind? 

In the ongoing debate over AI, publishers and copyright aggregators have suggested that they have brought these lawsuits to defend the interests of human authors. Consider the New York Times for example, in its complaint against OpenAI, NY Times describes their operations as “a creative and deeply human endeavor (¶31)” that necessitates “investment of human capital (¶196).” NY Times argues that OpenAI has built innovation on the stolen hard work and creative output from journalists, editors, photographers, data analysts, and others—an argument contrary to what the NY Times once argued in court in New York Times v. Tasini,  that authors’ rights must take a backseat to NY Times’ financial interests in new digital uses.  

It is also hard to believe that many of the publishers and aggregators are on the side of authors when we look at how they have approached licensing deals for AI training. These licensing deals can be extremely profitable for the publishers. For example, Taylor and Francis sold AI training data to OpenAI for 10 million USD. John Wiley and Sons earned $23 million from a similar deal with a non-disclosed tech company. Though we don’t have the details of these agreements, it seems easy to surmise that in return for the money received, the publishers will not harass the AI companies with future lawsuits. (See our previous blog post about these licensing deals and what you can do as an author.) It is ironic how an allegedly unethical and harmful practice quickly becomes acceptable once the publishers are profiting from it.

How much of the millions of dollars changing hands will go to individual authors? Limited data exist. We know that Cambridge University Press, a good-faith outlier, is offering authors 20% royalties if their work is licensed for AI training. Most publishers and aggregators are entirely opaque about how authors are to be compensated in these deals. Take the Copyright Clearance Center (CCC) for example, it offers zero information about how individual authors are consulted or compensated when their works are sold for AI training under CCC AI training license.

This is by no means a new problem for authors. We know that traditionally-published book authors receive around 10% of royalties from their publishers: a little under $2 per copy for most books. On an ebook, authors receive a similar amount for each “copy” sold. This little amount handed to authors only starts to look generous when compared to academic publishing, where authors increasingly pay publishers to have their articles published in journals. The journal authors receive zero royalties, despite the publishers’ growing profit

Even before the advent of AI technology, most authors were struggling to make a living on writing alone. According to an Authors Guild’s survey in 2018, the median income for full-time writers was $20,300, and for part-time writers, a mere $6,080. Fair wage and equitable profit sharing is an issue that needs to be settled between authors and publishers, even if publishers try to scapegoat AI companies. 

It’s worth acknowledging that it’s not just publishers and copyright industry organizations filing these lawsuits. Many of these ongoing lawsuits have been filed as class actions, with the plaintiffs claiming to represent a broad class of people who are similarly situated and (thus they alleged) hold similar views. Most notably, in Authors Guild v. OpenAI, Authors Guild and its named individual plaintiffs claim to represent all fiction writers in the US who have sold more than 5000 copies of a work. There’s also another case where plaintiff claims to represent all copyright holders of non-fiction works, including authors of academic journal articles, which got support from Authors Guild, and several others in which an individual plaintiff asserts the right to represent virtually all copyright holders of any type

As we (along with many others) have repeatedly pointed out, many authors disagree with the publishers and aggregators’ restrictive view on fair use in these cases, and don’t want or need a self-appointed guardian to “protect” their interests.  We have seen the same over-broad class designation in the Authors Guild v. Google case, which caused many authors to object, including many of our own 200 founding members.

Respect for copyright and human authors’ hard work means no more AI training under US copyright law? 

While we wait for courts to figure out the key questions on infringement and fair use, let’s take a moment to remember what copyright law does not regulate.

Copyright law in the US exists to further the Constitutional goal to “promote the Progress of Science and useful Arts.” In 1991, the Supreme Court held in Feist v. Rural Telephone Service that copyright cannot be granted solely based on how much time or energy authors have expended. “Compensation for hard work“ may be a valid ethical discussion, but it is not a relevant topic in the context of copyright law.

Publishers and aggregators preach that people must “respect copyright,” as if copyright is synonymous with the exclusive rights of the copyright holder. This is inaccurate and misleading. In order to safeguard the freedom of expression, copyright is designed to embody not only the rightsholders’ exclusive rights but also many exceptions and limitations to the rightsholders’ exclusive rights. Similarly, there’s no sound legal basis to claim that authors must have absolute control over their own work and its message. Knowledge and culture thrives because authors are permitted to build upon and reinterpret the works of others

Does this mean I should side with the AI companies in this debate?

Many of the largest AI companies exhibit troubling traits that they have in common with many publishers, copyright aggregators, digital platforms (e.g., Twitter, TikTok, Youtube, Amazon, Netflix, etc.), and many other companies with dominant market power. There’s no transparency or oversight afforded to the authors or the public. The authors and the public have little say in how the AI models are trained, just like how we have no influence over how content is moderated on digital platforms, how much royalties authors receive from the publishers, or how much publishers and copyright aggregators can charge users. None of these crucial systematic flaws will be fixed by granting publishers a share of AI companies’ revenue. 

Copyright also is not the entire story. As we’ve seen recently, there are some significant open questions about the right of publicity and somewhat related concerns about the ability of AI to churn out digital fakes for all sorts of purposes, some of which are innocent, but others are fraudulent, misleading, or exploitative. The US Copyright Office released a report on digital replicas on July 31 addressing the question of digital publicity rights, and on the same day the NO FAKES Act was officially introduced. Will the rights of authors and the public be adequately considered in that debate? Let’s remain vigilant as we wait to see the first-ever AI-generated public figure in a leading role to hit theaters in September 2024.

Some Initial Thoughts on the US Copyright Office Report on AI and Digital Replicas

Posted August 1, 2024

On July 31, 2024, the U.S. Copyright Office published Part 1 of its report summarizing the Office’s ongoing initiative of artificial intelligence. This first part of the report addresses digital replicas, in other words, how AI is used to realistically but falsely portray people in digital media. The Office in its report recommends new federal legislation that would create a new right to control “digital replicas” which it defines as  “a video, image, or audio recording that has been digitally created or manipulated to realistically but falsely depict an individual.”

We remain somewhat skeptical that such a right would do much to address the most troubling abuses such as deepfakes, revenge porn, and financial fraud. But, as the report points out, a growing number of varied state legislative efforts are already in the works, making a stronger case for unifying such rules at the federal level, with an opportunity to ensure adequate protections are in place for creators.  

The backdrop for the inquiry and report is a fast-developing space of state-led legislation, including legislation with regard to deepfakes. Earlier this year, Tennessee became the first state to enact such a law, the ELVIS Act (TN HB 2091), while other states mostly focused on addressing deepfakes in the context of sexual acts and political campaigns. New state laws are continuing to be introduced, making it harder and harder to navigate the space for creators, AI companies, and consumers alike. A federal right of publicity in the context of AI has already been discussed in Congress, and just yesterday a new bill was formally introduced, titled the “NO AI Fakes Act.” 

Authors Alliance has watched the development of this US Copyright Office initiative closely. In August 2023, the Office issued a notice of inquiry, asking stakeholders to weigh in on a series of questions about copyright policy and generative AI.  Our comment in response to the inquiry was devoted in large part to sharing the ways that authors are using generative AI, how fair use should apply to training AI, and that the USCO should be cautious in recommending new legislation to Congress

This report and recommendation from the Copyright Office could have a meaningful impact on authors and other creators, including both those whose personality and images are subject to use with AI systems, and those who are actively using AI in the writing and research. Below are our preliminary thoughts on what the Copyright Office recommends, which it summarizes in the report as follows: 

“We recommend that Congress establish a federal right that protects all individuals during their lifetimes from the knowing distribution of unauthorized digital replicas. The right should be licensable, subject to guardrails, but not assignable, with effective remedies including monetary damages and injunctive relief. Traditional rules of secondary liability should apply, but with an appropriately conditioned safe harbor for OSPs. The law should contain explicit First Amendment accommodations. Finally, in recognition of well-developed state rights of publicity, we recommend against full preemption of state laws.”

Initial Impressions

Overall, this seems like a well-researched and thoughtful report, given that the Office had to navigate a huge number of comments and opinions (over 10,000 comments were submitted). The report also incorporates the many more recent developments that included numerous new state laws and federal legislative proposals.  

Things we like: 

  • In the context of an increasing number of state legislative efforts—some overbroad and more likely than not to harm creators than help them—we appreciate the Office’s recognition that a patchwork of laws can pose a real problem for users and creators who are trying to understand their legal obligations when using AI that references and implicates real people.
  • The report also recognizes that the collection of concerns motivating digital replica laws—things like control of personality, privacy, fraud, and deception—are not at their core copyright concerns. “Copyright and digital replica rights serve different policy goals; they should not be conflated.” This matters a lot for what the scope of protection and other details for a digital replica right looks like. Copy-pasting copyright’s life+70 term of protection, for example, makes little sense (and the Office recognizes this, for example, by rejecting the idea of posthumous digital replica rights). 
  • The Office also suggests limiting the transferability of rights. We think this is a good idea to protect individuals from unanticipated downstream use by companies that may persuade individuals to sign deals that would lock them into unfavorable long-term deals. “Unlike publicity rights, privacy rights, almost without exception, are waivable or licensable, but cannot be assigned outright. Accordingly, we recommend a ban on outright assignments, and the inclusion of appropriate guardrails for licensing, such as limitations in duration and protection for minors.” 
  • The Office explicitly rejects the idea of a new digital replica right covering “artistic style.” We agree that protection of artistic style is a bad idea. Creators of all types have always used existing styles and methods as a baseline to build upon, and it’s resulted in a rich body of new works. Allowing for control over “style” however well-defined, would impinge on these new creations. Strong federal protection over “style” would also contradict traditional limitations on rights, such as Section 102(b)’s limits on copyrightable subject matter and the idea/expression dichotomy, which are rooted in the Constitution. 

Some concerns: 

  • The Office’s proposal would apply to the distribution of digital replicas, which are defined as “a video, image, or audio recording that has been digitally created or manipulated to realistically but falsely depict an individual.” This definition is quite broad and could potentially include a large number of relatively common and mostly innocuous uses—e.g., taking a photo with your phone of a person and applying a standard filter on your camera app could conceivably fall within the definition. 
  • First Amendment rights to free expression are critical for protecting uses for news reporting, artistic uses, parody and so on. Expressive uses of digital replicas—e.g., a documentary that uses AI to replicate a recent event involving recognizable people, or reproduction in a comedy show to to poke fun at politicians—could be significantly hindered by an expansive digital replica right unless it has robust free expression protections. Of course, the First Amendment applies regardless of the passing of a new law, but it will be important for any proposed legislation to find ways to allow people to exercise those rights effectively. As the report explains, comments were split. Some like the Motion Picture Association proposed enumerated exceptions for expressive use, while others such as the Recording Industry Association of America took the position that “categorical exclusions for certain speech-oriented uses are not constitutionally required and, in fact, risk overprotection of speech interests at the expense of important publicity interests.” 

We tend to think that most laws should skew toward “overprotection of speech interests,” but the devil is in the details on how to do so. The report leaves much to be desired on how to do this effectively in the context of digital replicas. For its part, “[t]he Office stresses the importance of explicitly addressing First Amendment concerns. While acknowledging the benefits of predictability, we believe that in light of the unique and evolving nature of the threat to an individual’s identity and reputation, a balancing framework is preferable.” One thing to watch in future proposals is what such a balancing framework actually includes, and how easy or difficult it is to assert protection of First Amendment rights under this balancing framework. 

  • The Office rejects the idea that Section 230 should provide protection for online service providers if they host content that runs afoul of the proposed new digital replica rights. Instead, the Office suggests something like a modified version of the Copyright Act’s DMCA section 512 notice and takedown process. This isn’t entirely outlandish—the DMCA process mostly works, and if this new proposed digital replica right is to be effective in practice, asking large service providers that are benefiting from hosting content to be responsive in cases of alleged infringing content may make sense. But, the Office says that it doesn’t believe the existing DMCA process should be the model, and points to its own Section 512 report for how a revised version for digital replicas might work. If the Office’s 512 study is a guide to what a notice-and-takedown system could look like for digital replicas, there is reason to be concerned.  While the study rejected some of the worst ideas for changing the existing system (e.g., a notice-and-staydown regime), it also repeatedly diminished the importance of ideas that would help protect creators with real First Amendment and fair use interests. 
  • The motivations for the proposed digital replica right are quite varied. For some commenters, it’s an objection to the commercial exploitation of public figures’ images or voices. For others, the need is to protect against invasions of privacy. For yet others, it is to prevent consumer confusion and fraud. The Office acknowledges these different motivating factors in its report and in its recommendations attempts to balance competing interests among them. But, there are still real areas of discontinuity—e.g., the basic structure of the right the Office proposes is intellectual-property-like. But it doesn’t really make a lot of sense to try to address some of the most pernicious fraudulent uses, such as deepfakes to manipulate public opinion, revenge porn, or scam phone calls, with a privately enforced property right oriented toward commercialization. Discovering and stopping those uses requires a very different approach and one that this particular proposal seems ill-equipped to deal with. 

Barely a few months ago, we were extremely skeptical that new federal legislation on digital replicas was a good idea. We’re still not entirely convinced, but the rash of new and proposed state laws does give us some pause. While the federal legislative process is fraught, it is also far from ideal for authors and creators to operate under a patchwork of varying state laws, especially those that provide little protection for expressive uses. Overall, we hope certain aspects of this report can positively influence the debate about existing federal proposals in Congress, but remain concerned about the lack of detail about protections for First Amendment rights. 

In the meantime, you can check out our two new resource pages on Generative AI and Personality Rights to get a better understanding of the issues.

What happens when your publisher licenses your work for AI training? 

Posted July 30, 2024
Photo by Andrew Neel on Unsplash

Over the last year, we’ve seen a number of major deals inked between companies like OpenAI and news publishers. In July 2023, OpenAI entered into a two-year deal with The Associated Press for ChatGPT to ingest the publisher’s news stories. In December 2023, Open AI announced its first non-US partnership to train ChatGPT on German publisher Axel Springer’s content, including Business Insider. This was then followed by a similar deal in March 2024 with Le Monde and Prisa Media, news publishers from France and Spain. These partnerships are likely sought in an effort to avoid litigation like the case OpenAI and Microsoft are currently defending from the New York Times.

As it turns out, such deals are not limited to OpenAI or newsrooms. Book publishers have also gotten into the mix. Numerous reports recently pointed out that based on Taylor and Francis’s parent company’s market update, the British academic publishing giant has agreed to a $10 million USD AI training deal with Microsoft. Earlier this year, another major academic publisher, John Wiley and Sons, recorded $23 million in one-time revenue from a similar deal with a non-disclosed tech company. Meta even considered buying Simon & Schuster or paying $10 per book to acquire its rights portfolio for AI training. 

With few exceptions (a notable one being Cambridge University Press), publishers have not bothered to ask their authors whether they approve of these agreements. 

Does AI training require licensing to begin with? 

First, it’s worth appreciating that these deals are made in the backdrop of some legal uncertainty. There are more than two dozen AI copyright lawsuits just in the United States, most of them turning on one key question: whether AI developers should have to obtain permission to scrap content to train AI models or whether fair use already allows this kind of training use even without permission. 

The arguments for and against fair use for AI training data are well explained elsewhere. We think there are strong arguments, based on cases like Authors Guild v. Google, Authors Guild v. HathiTrust, and AV ex rel Vanderhye v. iParadigms, that the courts will conclude that copying to training AI models is fair use. We also think there are really good policy reasons to think this could be a good outcome if we want to encourage future AI development that isn’t dominated by only the biggest tech giants and that results in systems that produce less biased outputs. But we won’t know for sure whether fair use covers any and all AI training until some of these cases are resolved. 

Even if you are firmly convinced that fair use protects this kind of use (and AI model developers have strong incentives to hold this belief), there are lots of other reasons why AI developers might seek licenses in order to navigate around the issue. This includes very practical reasons, like securing access to content in formats that make training easier, or content accompanied by structured, enhanced metadata. Given the pending litigation, licenses are also a good way to avoid costly litigation (copyright lawsuits are expensive, even if you win). 

Although one can hardly blame these tech companies for making a smart business decision to avoid potential litigation, this could have a larger systematic impact on other players in the field, including the academic researchers who would like to rely on fair use to train AI. As IP scholar James Gibson explains, when risk-averse users create new licensing markets in gray areas of copyright law, copyright holders’ exclusive rights expands, and public interest diminishes. The less we rely on fair use, the weaker it becomes.

Finally, it’s worth noting that fair use is only available in the US and a few other jurisdictions. In other jurisdictions, such as within the EU, using copyrighted materials for AI training (especially for commercial purposes) may require a license. 

To sum up: even though it may not be legally necessary to acquire copyright licenses for AI training, it seems that licensing deals between publishers and AI companies are highly likely to continue. 

So, can publishers just do this without asking authors? 

In a lot of cases, yes, publishers can license AI training rights without asking authors first. Many publishing contracts include a full and broad grant of rights–sometimes even a full transfer of copyright to the publisher for them to exploit those rights and to license the rights to third parties. For example, this typical academic publishing agreement provides that “The undersigned authors transfer all copyright ownership in and relating to the Work, in all forms and media, to the Proprietor in the event that the Work is published.” In such cases, when the publisher becomes the de facto copyright holder of a work, it’s difficult for authors to stake a copyright claim when their works are being sold to train AI.

Not all publishing contracts are so broad, however. For example, in the Model Publishing Contract for Digital Scholarship (which we have endorsed), the publisher’s sublicensing rights are limited and specifically defined, and profits resulting from any exploitation of a work must be shared with authors.  

There are lots of variations, and specific terms matter. Some publisher agreements are far more limited–transferring only limited publishing and subsidiary rights. These limitations in the past have prompted litigation over whether the publisher or the author gets to control rights for new technological uses. Results have been highly dependent on the specific contract language used. 

There are also instances where publishers aren’t even sure of what they own. For example, in the drawn-out copyright lawsuit brought by Cambridge University Press, Oxford University Press and Sage against Georgia State University, the court dropped 16 of the alleged 74 claimed instances of infringement because the publishers couldn’t produce documentation that they actually owned rights in the works they were suing over. This same lack of clarity contributed to the litigation and proposed settlement in the Google Books case, which is probably our closest analogy in terms of mass digitization and reuse of books (for a good discussion of these issues, see page 479 of this law review article by Pamela Samuelson about the Google Books settlement). 

This is further complicated by the fact that authors sometimes are entitled to reclaim their rights, such as by rights reversion clause and copyright termination. Just because a publisher can produce the documentation of a copyright assignment, does not necessarily mean that the publisher is still the current copyright holder of a work. 

We think it is certainly reasonable to be skeptical about the validity of blanket licensing schemes between large corporate rights holders and AI companies, at least when they are done at very large scale. Even though in some instances publishers do hold rights to license AI training, it is dubious whether they actually hold, and sufficiently document, all of the purported rights of all works being licensed for AI training.

Can authors at least insist on a cut of the profit? 

It can feel pretty bad to discover that massively profitable publishers are raking in yet more money by selling licensing rights to your work, while you’re cut out of the picture. If they’re making money, why not the author? 

It’s worth pointing out that, at least for academic authors, this isn’t exactly a novel situation–most academic authors make very little in royalties on their books, and nothing at all on their articles, while commercial publishers like Elsevier, Wiley, and SpringerNature sell subscription access at a healthy profit.  Unless you have retained sublicensing rights, or your publishing contract has a profit-sharing clause, authors, unfortunately, are not likely to profit from the budding licensing market for AI training.

So what are authors to do? 

We could probably start most posts like this with a big red banner that says “READ YOUR PUBLISHING CONTRACT!! (and negotiate it too)”  Be on the lookout for what you are authorizing your publisher to do with your rights, and any language in it about reuse or the sublicensing of subsidiary rights. 

You might also want to look for terms in your contract that speak to royalties and shares of licensing revenue. Some contracts have language that will allow you to demand an accounting of royalties; this may be an effective means of learning more about licensing deals associated with your work. 

You can also take a closer look at clauses that allow you to revert rights–many contracts will include a clause under which authors can regain rights when their book falls below a certain sales threshold or otherwise becomes “out of print.” Even without such clauses, it is reasonable for authors to negotiate a reversion of rights when their books are no longer generating revenue. Our resources on reversion will give you a more in-depth look at this issue.

Finally, you can voice your support for fair use in the context of licensing copyrighted works for AI training. We think fair use is especially important to preserve for non-commercial uses. For example, academic uses could be substantially stifled if paid-for licensing for permission to engage in AI research or related uses becomes the norm. And in those cases, the windfall publishers hope to pocket isn’t coming from some tech giant, but ultimately is at the expense of researchers, their libraries and universities, and the public funding that goes to support them.