Tag Archives: text and data mining

Restricting Innovation: How Publisher Contracts Undermine Scholarly AI Research

Posted December 6, 2024
Photo by Josh Appel on Unsplash

This post is by Rachael Samberg, Director, Scholarly Communication & Information Policy, UC Berkeley Library and Dave Hansen, Executive Director, Authors Alliance

This post is about the research and the advancement of science and knowledge made impossible when publishers use contracts to limit researchers’ ability to use AI tools with scholarly works. 

Within the scholarly publishing community, mixed messages pervade about who gets to say when and how AI tools can be used for research reliant on scholarly works like journal articles or books. Some scholars voiced concern (explained more here) when major scholarly publishers like Wiley or Taylor & Francis entered lucrative contracts with big technology companies to allow for AI training without first seeking permission from authors. We suspect that these publishers have the legal right to do so since most publishers demand that authors hand over extensive rights in exchange for publishing their work. And with the backdrop of dozens of pending AI copyright lawsuits, who can blame the AI companies for paying for licenses, if for no other reason than avoiding the pain of litigation? While it stings to see the same large commercial, academic publishers profit yet again off of the work academic authors submit to them for free, we continue to think there are good ways for authors to retain a say in the matter. 

 Big tech companies are one thing, but what about scholarly research? What about the large and growing number of scholars who are themselves using scholarly copyrighted content with AI tools to conduct their research? We currently face a situation in which publishers are attempting to dictate how and when researchers can do that work, even when authors’ fair use rights to use and derive new understandings from scholarship clearly allow for such uses. 

How vendor contracts disadvantage US researchers

We have written elsewhere (in an explainer and public comment to the Copyright Office) why training AI tools, particularly in the scholarly and research context, constitutes a fair use under U.S. Copyright law. Critical for the advancement of knowledge, training AI is based on a statutory right already held by all scholarly authors engaging in computational research and one that lawmakers should preserve. 

The problem U.S. scholarly authors presently face with AI training is that publishers restrict their access to these statutory rights through contracts that override them: In the United States, publishers can use private contracts to take away statutory fair use rights that researchers would otherwise hold under Federal law. In this case, the private contracts at issue are the electronic resource (e-resource) license agreements that academic research libraries sign to secure campus access to electronic journal, e-book, data, and other content that scholars need for their computational research.

Contractual override of fair use is a problem that disparately disadvantages U.S. researchers. As we have described elsewhere, more than forty countries, including the European Union, expressly reserve text mining and AI training rights for scientific research by research institutions. Not only do scholars in these countries not have to worry whether their computational research with AI is permitted, but also: They do not risk having those reserved rights overridden by contract. The European Union’s Copyright Digital Single Market Directive and recent AI Act nullify any attempt to circumscribe the text and data mining and AI training rights reserved for scientific research within research organizations. U.S. scholars are not as fortunate. 

In the U.S., most institutional e-resource licenses are negotiated and managed by research libraries, so it is imperative that scholars work closely with their libraries and advocate to preserve their computational research and AI training rights within the e-resource license agreements that universities sign. To that end, we have developed adaptable licensing language to support institutions in doing that nationwide. But while this language is helpful, the onus of advocacy and negotiation for those rights in the contracting process remains. Personally, we have found it helpful to explain to publishers that they must consent to these terms in the European Union, and can do so in the U.S. as well. That, combined with strong faculty and administrative support (such as at the University of California), makes for a strong stance against curtailment of these rights.

But we think there are additional practical ways for libraries to illustrate—both to publishers and scholarly authors—exactly what would happen to the advancement of knowledge if publishers’ licensing efforts to curtail AI training were successful. One way to do that is by “unpacking” or decoding a publisher’s proposed licensing restriction, and then demonstrating the impact that provision would have on research projects that were never objectionable to publishers before, and should not be now. We’ll take that approach below.

Decoding a publisher restriction

A commercial publisher recently proposed the following clause in an e-resource agreement:

Customer [the university] and its Authorized Users [the scholars] may not:

  1. directly or indirectly develop, train, program, improve, and/or enrich any artificial intelligence tool (“AI Tool”) accessible to anyone other than Customer and its Authorized Users, whether developed internally or provided by a third party; or
  2. reproduce or redistribute the Content to any third-party AI Tool, except to the extent limited portions of the Content are used solely for research and academic purposes (including to train an algorithm) and where the third-party AI Tool (a) is used locally in a self-hosted environment or closed hosted environment solely for use by Customer or Authorized Users; (b) is not trained or fine-tuned using the Content or any part thereof; and (c) does not share the Content or any part thereof with a third party.  

What does this mean?

  • The first paragraph forbids the training or improving of any AI tool if it’s accessible or released to third parties. And, it further forbids the use of any computational outputs or analysis that are derived from the licensed content from being used to train any tool available to third parties. 
  • The second paragraph is perhaps even more concerning. It provides that when using third party AI tools of any kind, a scholar can use only limited portions of the licensed content with the tools, and are prohibited from doing any training at all of third party tools even if it’s a non-generative AI tool and the scholar is performing the work in a completely closed and highly secure research environment.

What would the impact of such a restrictive licensing provision be on research? 

It would mean that every single one of the trained tools in the following projects could never be disseminated. In addition, for the projects below that used third-party AI tools, the research would have been prohibited full-stop because the third-party tools in those projects required training which the publisher above is attempting to prevent:

Tools that could not be disseminated

  1. In 2017, chemists created and trained a generative AI tool on 12,000 published research papers regarding synthesis conditions for metal oxides, so that the tool could identify anticipated chemical outputs and reactions for any given set of synthesis conditions entered into the tool. The generative tool they created is not capable of reproducing or redistributing any licensed content from the papers; it has merely learned conditions and outcomes and can predict chemical reactions based on those conditions and outcomes. And this beneficial tool would be prohibited from dissemination under the publisher’s terms identified above.
  2. In 2018, researchers trained an AI tool (that they had originally created in 2014) to understand whether a character is “masculine” or “feminine” by looking at the tacit assumptions expressed in words associated with that character. That tool can then look at other texts and identify masculine or feminine characters based on what it knows from having been trained before. The implications are that scholars can therefore use texts from different time periods with the tool to study representations of masculinity and femininity over time. No licensed content, no licensed or copyrighted books from a publisher can ever be released to the world by sharing the trained tool; the trained tool is merely capable of topic modeling—but the publisher’s above language would prohibit its dissemination nevertheless. 

Tools that could neither be trained nor disseminated 

  1. In 2019, authors used text from millions of books published over 100 years to analyze cultural meaning. They did this by training third-party non-generative AI word-embedding models called Word2Vec and GLoVE on multiple textual archives. The tools cannot reproduce content: when shown new text, they merely represent words as numbers, or vectors, to evaluate or predict how similar words in a given space are semantically or linguistically. The similarity of words can reveal cultural shifts in understanding of socioeconomic factors like class over time. But the publisher’s above licensing terms would prohibit the training of the tools to begin with, much less the sharing of them to support further or different inquiry. 
  2. In 2023, scholars trained a third-party-created open-source natural language processing (NLP) tool called Chemical Data Extractor (CDE). Among other things, CDE can be used to extract chemical information and properties identified in scholarly papers. In this case, the scholars wanted to teach CDE to parse a specific type of chemical information: metal-organic frameworks, or MoFs. Generally speaking, the CDE tool works by breaking sentences into “tokens” like parts of speech and referenced chemicals. By correlating tokens, one can determine that a particular chemical compound has certain synthetic properties, topologies, reactions with solvents, etc. The scholars trained CDE specifically to parse MoF names, synthesis methods, inorganic precursors, and more—and then exported the results into an open source database that identifies the MoF properties for each compound. Anyone can now use both the trained CDE tool and the database of MoF properties to ask different chemical property questions or identify additional MoF production pathways—thereby improving materials science for all. Neither the CDE tool nor the MoF database reproduces or contains the underlying scholarly papers that the tool learned from. Yet, neither the training of this third-party CDE tool nor its dissemination would be permitted under the publisher’s restrictive licensing language cited above.

Indeed, there are hundreds of AI tools that scholars have trained and disseminated—tools that do not reproduce licensed content—and that scholars have created or fine-tuned to extract chemical information, recognize faces, decode conversations, infer character types, and so much more. Restrictive licensing language like that shown above suppresses research inquiries and societal benefits that these tools make possible. It may also disproportionately affect the advancement of knowledge in or about developing countries, which may lack the resources to secure licenses or be forced to rely on open-source or poorly-coded public data—hindering journalism, language translation, and language preservation.

Protecting access to facts

Why are some publishers doing this? Perhaps to reserve the opportunity to develop and license their own scholarship-trained AI tools, which they could then license at additional cost back to research institutions. We could speculate about motivations, but the upshot is that publishers have been pushing hard to foreclose scholars from training and dissemination AI tools that now “know” something based on the licensed content. That is, such publishers wish to prevent tools from learning facts about the licensed content. 

However, this is precisely the purpose of licensing content. When institutions license content for their scholars to read, they are doing so for the scholars to learn information from the content. When scholars write about it or teach about the content, they are not regenerating the actual expression from the content—the part that is protected by copyright; rather the scholars are conveying the lessons learned from the content—facts not protected by copyright. Prohibiting the training of AI tools and the dissemination of those tools is functionally equivalent to prohibiting scholars from learning anything about the content that institutions are licensing for that very purpose, and that scholars have written to begin with! Publishers should not be able to monopolize the dissemination of information learned from scholarly content, and especially when that information is used non-commercially.

For these reasons, when we negotiate to preserve AI usage and training rights, we generally try to achieve the following outcomes which would promote—rather than prohibit—all of the research projects described above:

The sample language we’ve disseminated empowers others to negotiate for these outcomes. We hope that, when coupled with the advocacy tools we’ve provided above, scholars and libraries can protect their AI usage and training rights, while also being equipped to consider how they want their own works to be used.

The DMCA 1201 Rulemaking: Summary, Key Takeaways, and Other Items of Interest

Posted November 8, 2024

Last month, we blogged about the key takeaways from the 2024 TDM exemptions recently put in place by the Librarian of Congress, including how the 2024 exemptions (1) expand researchers’ access to existing corpora, (2) definitively allow the viewing and annotation of copyrighted materials for TDM research purposes, and (3) create new obligations for researchers to disclose security protocols to trade associations. Beyond these key changes, the TDM exemptions remain largely the same: researchers affiliated with universities are allowed to circumvent TPMs to compile corpora for TDM research, provided that those copies of copyrighted materials are legally obtained and adequate security protocols are put in place.

We have since updated our resources page on Text and Data Mining and have incorporated the new developments into our TDM report: Text and Data Mining Under U.S. Copyright Law: Landscape, Flaws & Recommendations.

In this blog post, we share some further reflections on the newly expanded TDM exemptions—including (1) the use of AI tools in TDM research, (2) outside researchers’ access to existing corpora, (3) the disclosure requirement, and (4) a potential TDM licensing market—as well as other insights that emerged during the 9th triennial rulemaking.

The TDM Exemption

In other jurisdictions, such as the EU, Singapore, and Japan, legal provisions that permit “text data mining” also allow a broad array of uses, such as general machine learning and generative AI model training. In the US, exemptions allowing TDM so far have not explicitly addressed whether AI could be used as a tool for conducting TDM research. In this round of remaking, we were able to gain clarity on how AI tools are allowed to aid TDM research. Advocates for the TDM exemptions provided ample examples of how machine learning and AI are key to conducting TDM research and asked that “generative AI” not be deemed categorically impermissible as a tool for TDM research. The Copyright Office agreed that a wide array of tools could be utilized for TDM research under the exemptions, including AI tools, as long as the purpose is to conduct “scholarly text and data mining research and teaching.” The Office was careful to limit its analysis to those uses and not address other applications such as compiling data—or reusing existing TDM corpora—for training generative AI models; those are an entirely separate issue from facilitating non-commercial TDM research.

Besides clarifying that AI tools are allowed for TDM research and that viewing and annotation are permitted for copyrighted materials, the new exemptions offer meaningful improvement to TDM researchers’ access to corpora. The previous 2021 exemptions allowed access for purposes of “collaboration,” but many researchers interpreted that narrowly, and the Office confirmed that “collaboration” was not meant to encompass outside research projects entirely unrelated to the original research for which the corpus was created. Under the 2021 exemptions, a TDM corpus could only be accessed by outside researchers if they are working on the same research project as the original compiler of the corpus. The 2024 exemptions’ expansion of access to existing corpora has two main components and advantages. 

The expansion now allows for new research projects to be conducted on existing corpora, permitting institutions that have created a corpus to provide access “to researchers affiliated with other nonprofit institutions of higher education, with all access provided only through secure connections and on the condition of authenticated credentials, solely for purposes of text and data mining research or teaching.” At the same time, it also opens up new possibilities for researchers at institutions who otherwise would not have access, as the new exemption does not require a precondition that the outside researchers’ institutions otherwise own copies of works in the corpora. The new exemptions pose some important limitations: only researchers at institutions of higher education are allowed this access, and nothing more than “access” is allowed—it does not, for example, allow the transfer of a corpus for local use. 

The Office emphasized the need for adequate security protections, pointing back to cases such as Authors Guild v. Google and Authors Guild v. HathiTrust, which emphasized how careful both organizations were, respectively, to prevent their digitized corpora from being misused. To take advantage of this newly expanded TDM exemption, it will be crucial for universities to provide adequate IT support to ensure that technical barriers do not impede TDM researchers. That said, the record for the exemption shows that existing users are exceedingly conscientious when it comes to security. There have been zero reported instances of security breaches or lapses related to TDM corpora being compiled and used under the exemptions. 

As we previously explained, the security requirements are changed in a few ways. The new rule clarifies that trade associations can send inquiries on behalf of rightsholders. However, inquiries must be supported by a “reasonable belief” that the sender’s works are in a corpus being used for TDM research. It remains to be seen how the new obligation to disclose security measures to trade associations would impact TDM researchers and their institutions. The Register circuitously called out demands by trade associations sent to digital humanities researchers in the middle of the exemption process with a two-week response deadline as unreasonable and quoted NTIA (which provides input on the exemptions) in agreement that  “[t]he timing, targeting, and tenor of these requests [for institutions to disclose their security protocols] are disturbing.”  We are hopeful that this discouragement from the Copyright Office will prevent any future large-scale harassment towards TDM researchers and their institutions, but we will also remain vigilant in case trade associations were to abuse this new power. 

Alongside the concerns over disclosure requirements, we have some questions about the Copyright Office’s treatment of fair use as a rationale for circumventing TPMs for TDM research. The Register restated her 2021 conclusion that “under Authors Guild, Inc. v. HathiTrust, lost licensing revenue should only be considered ‘when the use serves as a substitute for the original.’” The Office, in its recommendations, placed considerable weight on the lack of a viable licensing market for TDM, which raises a concern that, in the Office’s view, a use that once was fair and legal might lose that status when the rightsholder starts to offer an adequate licensing option. While this may never become a real issue for the existing TDM exemptions (because no sufficient licensing options exist for TDM researchers, and for the breadth and depth of content needed, it seems unlikely to ever develop), it nonetheless contributes to the growing confusion surrounding the stability of a fair use defense in the face of new licensing markets. 

These concerns highlight the need for ongoing advocacy in the realm of TDM research. Overall, the Register of Copyright recognizes TDM as “a relatively new field that is quickly evolving.” This means that we could ask the Library of Congress to relax the limitations placed on TDM if we can point to legitimate research-related purposes. But, due to the nature of this process, it also means TDM researchers do not have a permanent and stable right to circumvent TPMs. As the exemptions remain subject to review every three years, many large trade associations advocate for the TDM exemptions to be greatly limited or even canceled, wishing to stifle independent TDM research. We will continue to advocate for TDM researchers, as we did during the 8th and 9th triennial rulemaking. 

Looking beyond the TDM exemption, we noted a few other developments: 

Warhol has not fundamentally changed fair use

First, the Opponents of the renewal of the existing exemptions repeatedly pointed to Warhol Foundation v. Goldsmith—the Supreme Court’s most recent fair use opinion—to argue that it has changed the fair use analysis such that the existing exemptions should not be renewed. For example, the Opponents argued that the fair use analysis for repairing medical devices changed under Warhol because, according to them, commercial nontransformative uses were less likely to be fair. The Copyright Office did not agree. The Register said that the same fair use analysis as in 2021 applied and that the Opponents failed “to show that the Warhol decision constitutes intervening legal precedent rendering the Office’s prior fair use analysis invalid.” In another instance where the Opponents tried to argue that commerciality must be given more weight under Warhol, the Register pointed out that under Warhol commerciality is not dispositive and must be weighed against the purpose of the new use.  The arguments for revisiting the 2021 fair use analyses were uniformly rejected, which we think is good news for those of us who believe Warhol should be read as making a modest adjustment to fair use and not a wholesale reworking of the fair use doctrine. 

Does ownership and control of copies matter for access? 

One of the requests before the Office was an expansion of an exemption that allows for access to preservation copies of computer programs and video games. The Office rejected the main thrust of the request but, in doing so, also provided an interesting clarification that may reveal some of the Office’s thinking about the relationship between fair use and access to copies owned by the user: 

The Register concludes that proponents did not show that removing the single user limitation for preserved computer programs or permitting off-premises access to video games are likely to be noninfringing. She also notes the greater risk of market harm with removing the video game exemption’s premises limitation, given the market for legacy video games. She recommends clarifying the single copy restriction language to reflect that preservation institutions can allow a copy of a computer program to be accessed by as many individuals as there are circumvented copies legally owned.”

That sounds a lot like an endorsement of the idea that the owned-to-loaned ratio, a key concept in the controlled digital lending analysis, should matter in the fair use analysis (which is something the Hachette v. Internet Archive controlled digital lending court gave zero weight to). For future 1201 exemptions, we will have to wait and see whether the Office will use this framework in other contexts. 

Addressing other non-copyright and AI questions in the 1201 process

The Librarian of Congress’s final rule included a number of notes on issues not addressed by the rulemaking: 

“The Librarian is aware that the Register and her legal staff have invested a great deal of time over the past two years in analyzing the many issues underlying the 1201 process and proposed exemptions. 

Through this work, the Register has come to believe that the issue of research on artificial intelligence security and trustworthiness warrants more general Congressional and regulatory attention. The Librarian agrees with the Register in this assessment. As a regulatory process focused on technological protection measures for copyrighted content, section 1201 is ill-suited to address fundamental policy issues with new technologies.” 

Proponents tried to argue that the software platforms’ restrictions and barriers to conducting AI research, such as their account requirements, rate limits, and algorithmic safeguards, are circumventable TPMs under 1201, but the Register disagreed. The Register maintained that the challenges Proponents described arose not out of circumventable TPMs but out of third-party controlled Software as a Service platforms. This decision can be illuminating for TDM researchers seeking to conduct TDM research on online streaming media or social media posts.

The Librarian’s note went on to say: “The Librarian is further aware of the policy and legal issues involving a generalized ‘‘right to repair’’ equipment with embedded software. These issues have now occupied the White House, Congress, state legislatures, federal agencies, the Copyright Office, and the general public through multiple rounds of 1201 rulemaking. 

Copyright is but one piece in a national framework for ensuring the security, trustworthiness, and reliability of embedded software, as well as other copyright-protected technology that affects our daily lives. Issues such as these extend beyond the reach of 1201 and may require a broader solution, as noted by the NTIA.”

These notes give an interesting, though a bit confusing, insight into how the Librarian of Congress and the Copyright Office think about the role of 1201 rulemaking when they address issues that go beyond copyright’s core concerns. While we can agree that 1201 is ill-suited to address fundamental policy issues with new technology, it is also somewhat concerning that the Office and the Librarian view copyright more generally as part of a broader “national framework for ensuring the security, trustworthiness, and reliability of embedded software.”  While of course, copyright is sometimes used to further ends outside of its intended purpose, these issues are far from the core constitutional purpose of copyright law and we think they are best addressed through other means. 

Text Data Mining Research DMCA Exemption Renewed and Expanded

Posted October 25, 2024
U.S. Copyright Office 1201 Rulemaking Process, taken from https://www.copyright.gov/1201/

Earlier today, the Library of Congress, following recommendations from the U.S. Copyright Office, released its final rule adopting exemptions to the Digital Millenium Copyright Act’s prohibition on circumvention of technological protection measures (e.g.,  DRM).  

As many of you know, we’ve been working closely with members of the text and data-mining community as well as our co-petitioners, the Library Copyright Alliance (LCA) and the American Association of University Professors (AAUP), to petition for renewal of the existing TDM research exemption and to expand it to allow researchers to share their research corpora with other researchers outside of their university (something not previously allowed). The process began over a year ago and followed an in-depth review process by the U.S. Copyright Office, and we’re incredibly grateful for the expert legal representation before the Office over this past year by UC Berkeley Law’s Samuelson Law, Technology & Public Policy Clinic, and in particular clinic faculty Erik Stallman, Jennifer Urban and Berkeley Law students Christian Howard-Sukhil, Zhudi Huang, and Matthew Cha.

We are very pleased to see that the Librarian of Congress both approved the renewal of the existing exemption and approved an expansion that allows for research universities to provide access to TDM corpora for use by researchers at other universities. 

The expanded rule is poised to make an immediate impact in helping the TDM researchers collaborate and build upon each other’s work. As Allison Cooper,  director of Kinolab and Associate Professor of Romance Languages and Literatures and Cinema Studies at Bowdoin College, explains:

“This decision will have an immediate impact on the ongoing close-up project that Joel Burges, Emily Sherwood, and I are working on by allowing us to collaborate with researchers like David Bamman, whose expertise in machine learning will be valuable in answering many of the ‘big picture’ questions about the close-up that have come up in our work so far.”

These are the main takeaways from the new rule: 

  • The exemption has been expanded to allow “access” to corpora by researchers at other institutions “solely for purposes of text and data mining research or teaching.” There is no more requirement that access be granted as part of a “collaboration,” so new researchers can ask new and different questions of a corpus. Access must be credentialed and authenticated.
  • The issue of whether a researcher can engage in “close viewing” of a copyrighted work has been resolved—as the explanation for the revised rule puts it, researchers can “view the contents of copyrighted works as part of their research, provided that any viewing that takes place is in furtherance of research objectives (e.g., processing or annotating works to prepare them for analysis) and not for the works’ expressive value.” This is a very helpful clarification!
  • The new rule also modified the existing security requirements, which provide that researchers must put in place adequate security protocols to protect TDM corpora from unauthorized reuse and must share information about those security protocols with rightsholders upon request. That rule has been limited in some ways and expanded in others. The new rule clarifies that trade associations can send inquiries on behalf of rightsholders. However, inquiries must be supported by a “reasonable belief” that the sender’s works are in a corpus being used for TDM research.

Later on, we will post a more in-depth analysis of the new rules–both TDM and others that apply to authors. The Librarian of Congress also authorized the renewal of a number of other rules that support research, teaching, and library preservation. Among them is a renewal of another exemption that Authors Alliance and AAUP petitioned for, allowing for the circumvention of digital locks when using motion picture excerpts in multi-media ebooks. 

Thank you to all of the many, many TDM researchers and librarians we’ve worked with over the last several years to help support this petition. 

You can learn more about TDM and our work on this issue through our TDM resources page, here.

Authors Alliance Submits Long-Form Comment to Copyright Office in Support of Petition to Expand Existing Text and Data Mining Exemption 

Posted January 29, 2024
Photo by Simona Sergi on Unsplash

Last month, Authors Alliance submitted detailed comments in response to the Copyright Office’s Notice of Proposed Rulemaking in support of our petition to expand the existing Digital Millennium Copyright Act (DMCA) exemptions that enable text and data mining (TDM) as part of this year’s §1201 rulemaking cycle

To recap: our expansion petitions ask the Copyright Office to modify the existing TDM exemption so that researchers who assemble corpora of ebooks or films on which to conduct text and data mining are able to share that corpus with other academic researchers, where this second group of researchers qualifies under the exemption. Under the current exemption, academic researchers are only able to share their corpora with other qualified researchers for purposes of “collaboration and verification.” This simple change would eliminate the need for duplicative efforts to remove digital locks from ebooks and films, a time and resource-intensive process, broadening the group of academic researchers who are able to use the exemption. 

Our comment argues that the existing TDM exemption has begun to enable valuable digital humanities research and teaching, but that the proposed expansion would go much further towards enabling this research and helping TDM researchers reach their goals. The comment is accompanied by 13 letters of support from researchers, educators, and funding organizations, highlighting the research that has been done in reliance on the exemption, and explaining why this expansion is necessary. Our thanks go out to our stellar clinical team at UC Berkeley’s Samuelson Law, Technology & Public Policy Clinic—law students Mathew Cha and Zhudi Huang, and clinical supervisor Jennifer Urban—for writing and submitting this comment on our behalf. We are also grateful to our co-petitioners, the Library Copyright Alliance and American Association of University Professors, for their support on this comment. 

Ambiguity in “Collaboration”

One reason the expansion is necessary is the uncertainty over what constitutes “collaboration” under the existing exemption. Researchers have open questions about what level of individual contribution to a project would make researchers “collaborators” under the exemption. As our comment explains, collaboration can come in a number of different forms, from “formal collaborations under the auspice of a grant, [to] ad hoc collaborations that result from two teams discovering that they are working on similar material to the same ends, or even discussions at conferences between members of a loose network of scholars working on the same broad set of interests.” But it is not clear which of these activities is “collaboration” for the purposes of the exemption. And this uncertainty has had a chilling effect on the socially valuable research made possible by the exemption. 

Costly Corpora Creation 

Our comment also highlights the vast costs that go into creating a usable corpus for TDM research. Institutions whose researchers are conducting TDM research pursuant to the exemption must lawfully own the works in question, or license them through a license that is not time-limited. But these costs pale in comparison to the required computing resources—a cost which is compounded by the exemption’s strict security requirements—and human labor involved in bypassing technical protection measures and assembling a corpus. Moreover, it’s important to recognize that there is simply not a tremendous amount of grant funding or even institutional support available to TDM researchers. 

Because corpora are so costly to assemble and create, we believe it to be reasonable to permit researchers to share their corpora with researchers at other institutions who want to conduct independent TDM research on these corpora. As the exemption currently stands, researchers interested in pre-existing corpora must duplicate the efforts of the previous researchers, incurring massive costs along the way. We’ve already seen indications that these costs can lead researchers to avoid certain research questions and areas of study altogether. As our comment explains, this “duplicative circumvention” can be avoided by changing the language of the exemption to permit corpora sharing between qualified researchers at separate institutions. 

Equity Issues

Worse still, not all institutions are able to bear these expenses. Our comment explains how the current exemption’s prohibition on sharing beyond collaboration and verification—and consequent duplication of prior labor—-”create[s] barriers that can prevent smaller and less-well-resourced institutions from conducting TDM research at all.” This creates inequity in what type of institutions can support TDM projects, and what types of researchers can conduct them. The unfortunate result has been that large institutions that have “the resources to compensate and maintain technical staff and infrastructure” are able to support TDM research under the exemption, while smaller institutions are not. 

Values of Corpora Sharing

Our comment explains how allowing limited sharing of corpora under the exemption would go a long way towards lowering barriers to entry for TDM research and ameliorating the equity issues described above. Since digital humanities is already an under-resourced field, the effects of enabling researchers to share their corpora with other academic researchers could be quite profound. 

Researchers who wrote letters in support of the petition described a multitude of exciting projects, and have built “a rich set of corpora to study, such as a collection of fiction written by African American writers, a collection of books banned in the United States, and a curated corpus of movies and television with an ‘emphasis on racial, ethnic, sexual, and gender diversity.’” Many of those who wrote letters in support of our petition recounted requests they’ve gotten from other researchers to use their corpora, and who were frustrated that the exemption’s prohibition on non-collaborative sharing and their limited capacity for collaboration prevented them from sharing these corpora. 

Allowing new researchers with new research questions to study these corpora could reveal new insights about these bodies of work. As we explain, “in the same way a single literary work or motion picture can evince multiple meanings based on the lens of analysis used, when different researchers study one corpus, they are able to pose different research questions and apply different methodologies, ultimately revealing new and original findings . . . . Enabling broader sharing and thus, increasing the number of researchers that can study a corpus, will allow a body of works to be better understood beyond the initial ‘limited set of research questions.’”

Fair Use

The 1201 rulemaking process for exemptions to DMCA § 1201’s prohibition on breaking digital locks requires that the proposed activity be a fair use. In the 2021 proceedings, the Office recognized TDM for research and teaching purposes as a fair use. Because the expansion we’re seeking is relatively minor, our comment explains that the types of uses we are asking the Office to permit researchers to make is also fair use. Our comment explains that each of the four fair use factors favor fair use in the context of the proposed expansion. We further explain why the enhanced sharing the expansion would provide does not harm the market for the original works under factor four: because institutions must lawfully own (or license under a non-time-limited license) the works that their researchers wish to conduct TDM on, it makes no difference from a market standpoint whether researchers bypass technical protection measures themselves, or share another institution’s corpus. Copyright holders are not harmed when researchers at one institution share a corpus created by researchers at another institution, since both institutions must purchase the works in order to be eligible under the exemption. 

What’s Next?

If there are parties that oppose our proposed expansion, they have until February 20th to submit opposition comments to the Copyright Office. Then, on March 19th, our reply comments to any opposition comments will be due. We will keep our readers and members apprised as the process continues to move forward.

Copyright Office Recommends Renewal of the Existing Text Data Mining Exemptions for Literary Works and Films

Posted October 19, 2023
Photo by Tim Mossholder on Unsplash

Authors Alliance is delighted to announce that the Copyright Office has recommended that the Librarian of Congress renew both of the exemptions to DMCA liability for text and data mining in its Notice of Proposed Rulemaking for this year’s DMCA exemptions, released today. While the Librarian of Congress could technically disagree with the recommendation to renew, this rarely if ever happens in practice. 

Renewal Petitions and Recommendations

Authors Alliance petitioned the Office to renew the exemptions in July, along with our co-petitioners the American Association of University Professors and the Library Copyright Alliance. Then, the Office entertained comments from stakeholders and the public at large who wished to make statements in support of or in opposition to renewal of the existing exemptions, before drawing conclusions about renewal in today’s notice. 

The Office did not receive any comments arguing against renewal of the TDM exemption for literary works distributed electronically; our petition was unopposed. The Office agreed with Authors Alliance and our co-petitioners, ARL and AAUP, observing that “researchers are actively relying on the current exemption” and citing to an example of such research that we highlighted in our petition. Apparently agreeing with our statement that there have not been “material changes in facts, law, technology, or other circumstances” since the 1201 rulemaking cycle when the exemption was originally obtained, the Office stated it intended to recommend that the exemption be renewed. 

Our renewal petition for the text and data mining exemption for motion pictures, which is identical to the literary works exemption in all aspects but the type of works involved, did receive one opposition comment, but the Copyright Office found that it did not meet the standard for meaningful opposition, and recommended renewal. DVD CCA (the DVD Copyright Control Association) and AACS LA (the Advanced Access Content System Licensing Administrator) submitted a joint comment arguing that a statement in our petition indicated that there had been a change in the facts surrounding the exemption. More specifically, they argued that our statement that “[c]ommercially licensed text and data mining products continue to be made available to research institutions” constituted an admission that new licensed databases motion pictures had emerged since the previous rulemaking. DVD CCA and AACS LA did not actually offer any evidence of the emergence of new licensed databases for motion pictures. We believed this opposition comment was without merit—while licensed databases for text and data mining of audiovisual works are not as prevalent as licensed databases for text and data mining of text-based works, some were available during the 2021 rulemaking, and continue to be available today. We are pleased that the Office agreed, citing to the previous rulemaking record as supporting evidence.

Expansions and Next Steps

In addition to requesting that the Office renew the current exemptions, we (along with AAUP and LCA) also requested that the Office consider expanding these exemptions to enhance a researcher’s ability to share their corpus with other researchers that are not their direct collaborators. The two processes run in parallel, and today’s announcement means that even if we do not ultimately obtain expanded exemptions, the existing exemptions are very likely to be renewed. 

In its NPRM, the Office also announced deadlines for the various submissions that petitions for expansions and new exemptions will require. The first round of comments in support of  our proposed expansion—including documentary evidence from researchers who are being adversely affected by the limited sharing permitted under the existing exemptions—will be due December 22nd. Opposition comments are due February 20, 2024. Reply comments to these opposition comments are then due March 24, 2024. Then, later in the spring, there will be a hearing with the Copyright Office regarding our proposed expansion. We will—as always—keep our readers apprised as the process moves forward. 

Authors Alliance and Allies Petition to Renew and Expand Text Data Mining Exemption

Posted September 6, 2023
Photo by Alina Grubnyak on Unsplash

Authors Alliance is pleased to announce that in recent weeks, we have submitted petitions to the Copyright Office requesting that it recommend renewing expanding the existing text data mining exemptions to DMCA liability to make the current legal carve-out that enables text and data mining more flexible, so that researchers can share their corpora of works with other researchers who want to conduct their own text data mining research. On each of these petitions, we were joined by two co-petitioners, the American Association of University Professors and the Library Copyright Alliance. These were short filings—requesting changes and providing brief explanations—and will be the first of many in our efforts to obtain a renewal and expansion of the existing TDM exemptions. 

Background

The Digital Millennium Copyright Act (DMCA) includes a provision that forbids people from bypassing technical protection measures on copyrighted works. But it also implements a triennial rulemaking process whereby organizations and individuals can petition for temporary exemptions to this rule. The Office recommends an exemption when its proponents show that they, or those they represent, are “adversely affected in their ability to make noninfringing [fair] uses due to the prohibition on circumventing access controls.” Every three years, petitioners must ask the Office to renew existing exemptions in order for them to continue to apply. Petitioners can also ask the Office to recommend expanding an existing exemption, which requires the same filings and procedure as petitioning for a new exemption. 

Back in 2020, during the eighth of these triennial rulemakings, Authors Alliance—along with the Library Copyright Alliance and the American Association of University Professors—petitioned the Copyright Office to create an exemption to DMCA liability that would enable researchers to conduct text and data mining. Text and data mining is a fair use, and the DMCA prohibitions on bypassing DRM and similar technical protection measures made it difficult or even impossible for researchers to conduct text and data mining on in-copyright e-books and films. After a long process which included filing a comment in support of the exemption and an ex parte meeting with the Copyright Office, the Office ultimately recommended that the Librarian of Congress grant our proposed exemption (which she did). The Office also recommended that the exemption be split into two parts, with one exemption addressing literary works distributed electronically, and the other addressing films. 

While the ninth triennial rulemaking does not technically happen until 2024, petitions for renewals, expansions, and new exemptions have already been filed. 

Our Petitions

Back in early July, we made our first filings with the Copyright Office in the form of renewal petitions for both exemptions. For this step, proponents of current exemptions simply ask the Copyright Office to renew them for another three year cycle, accompanied by a short explanation of whether and how the exemption is being used and a statement that neither law nor technology has changed such that the exemption is no longer warranted. Other parties are then given an opportunity to respond to or oppose renewal petitions. The Office recommends that exemption proponents who want to expand a current exemption also petition for its renewal—which is just what we did. In our renewal petitions, we explained how researchers are using the exemptions and how neither recent case law nor the continued availability of licensed TDM databases represent changes in the law or technology, making renewal of the TDM exemptions proper and justified. The renewal petitions follow a streamlined process, where they are generally simply granted unless the Office finds there to be “meaningful opposition” to a renewal petition, articulating a change in the law or facts. You can find our renewal petition for the literary works TDM exemption here, and our renewal petition for the film TDM exemption here.

But we also sought to expand the current exemptions, in two petitions submitted a few weeks back. In our expansion petitions, we proposed a simple change that we would like to see made to the current DMCA exemptions for text data mining. In the exemption’s current form, academic researchers can bypass technical protection measures to assemble a corpus on which to conduct TDM research, but they can only share it with other researchers for purposes of “collaboration and verification.” We asked the Office to permit these researchers to share their corpora with other researchers who want to use the corpus to conduct TDM research, but are not direct collaborators. However, this second group of researchers would still have to comply with the various requirements of the exemption, such as complying with security measures. Essentially, we seek to expand the sharing provision of the current exemption while leaving the other provisions intact. This is largely based on feedback we have received from those using the exemption and our understanding of how the regulation can be improved so that their desired noninfringing uses are no longer adversely affected by this limitation. You can find our expansion petition for the literary works TDM exemption here, and our expansion petition for the film TDM exemption here.

What’s Next?

The next step in the triennial rulemaking process is the Copyright Office issuing a notice of proposed rulemaking, where it will lay out its plan of action. While we do not have a set timeline for the notice of proposed rulemaking, during the last rulemaking cycle, it happened in mid-October—meaning it is reasonable to expect the Office to issue this notice in the next two months or so. Then, there will be several rounds of comments in support of or in opposition to the proposals. Finally, the Office will issue a final recommendation, and the Librarian of Congress will issue a final rule. While the Librarian of Congress is not legally obligated to adopt the Copyright Office’s recommendations, they traditionally do. Based on last year’s cycle, we can expect a final rule to be issued around October 2024. So we are in for a long wait and a lot of work! We will keep our readers updated as the rulemaking moves forward.

Prosecraft, text and data mining, and the law

Posted August 14, 2023

Last week you may have read about a website called prosecraft.io, a site with an index of some 25,000 books that provided a variety of data about the texts (how long, how many adverbs, how much passive voice) along with a chart showing sentiment analysis of the works in its collection and displayed short snippets from the texts themselves, two paragraphs representing the most and least vivid from the text. Overall, it was a somewhat interesting tool, promoted to authors to better understand how their work compares to those of other published works. 

The news cycle about prosecraft.io was about the campaign to get its creator Benji Smith to take the site down (he now has) based on allegations of copyright infringement. A Gizmodo story about it generated lots of attention, and it’s been written up extensively, for example here, here, here, and here.  

It’s written about enough that I won’t repeat the whole saga here. However, I think a few observations are worth sharing:  

1) Don’t get your legal advice from Twitter (or whatever its called)

Fair Use does not, by any stretch of the imagination, allow you to use an author’s entire copyrighted work without permission as a part of a data training program that feeds into your own ‘AI algorithm.’”  – Linda Codega, Gizmodo (a sentiment that was retweeted extensively)

Fair use actually allows quite a few situations where you can copy an entire work, including situations when you can use it as part of a data training program (and calling an algorithm “AI” doesn’t magically transform it into something unlawful). For example, way back in 2002 in Kelly v. Ariba Soft, the 9th Circuit concluded that it was fair use to make full text copies of images found on the internet for the purpose of enabling web image search. Similarly, in AV ex rel Vanderhye v. iParadigms, the 4th Circuit in 2009 concluded that it was fair use to make full text copies of academic papers for use in a plagiarism detection tool.  

Most relevant to prosecraft, in Authors Guild v. HathiTrust (2014)  and Authors Guild v. Google (2015) the Second Circuit held that Google’s copying of millions of books for purposes of creating a massive search engine of their contents was fair use . Google produced full-text searchable databases of the works, and displayed short snippets containing whatever term the user had searched for (quite similar to prosecraft’s outputs). That functionality also enabled a wide range of computer-aided textual analysis, as the court explained: 

The search engine also makes possible new forms of research, known as “text mining” and “data mining.” Google’s “ngrams” research tool draws on the Google Library Project corpus to furnish statistical information to Internet users about the frequency of word and phrase usage over centuries.  This tool permits users to discern fluctuations of interest in a particular subject over time and space by showing increases and decreases in the frequency of reference and usage in different periods and different linguistic regions. It also allows researchers to comb over the tens of millions of books Google has scanned in order to examine “word frequencies, syntactic patterns, and thematic markers” and to derive information on how nomenclature, linguistic usage, and literary style have changed over time. Authors Guild, Inc., 954 F.Supp.2d at 287. The district court gave as an example “track[ing] the frequency of references to the United States as a single entity (‘the United States is’) versus references to the United States in the plural (‘the United States are’) and how that usage has changed over time.”

While there are a number of generative AI cases pending (a nice summary of them is here) that I agree raise some additional legal questions beyond those directly answered in Google Books, the kind of textual analysis that prosecraft.io offered seems remarkably similar to the kinds of things that the courts have already said are permissible fair uses. 

2) Text and data mining analysis has broad benefits

Not only is text mining fair use, it also yields some amazing insights that truly “promote the progress of Science,” which is what copyright law is all about.  Prosecraft offered some pretty basic insights into published books – how long, how many adverbs, and the like. I can understand opinions being split on whether that kind of information is actually helpful for current or aspiring authors. But, text mining can reveal so much more. 

In the submission Authors Alliance made to the US Copyright Office three years ago in support of a Section 1201 Exemption permitting text data mining, we explained:

TDM makes it possible to sift through substantial amounts of information to draw groundbreaking conclusions. This is true across disciplines. In medical science, TDM has been used to perform an overview of a mass of coronavirus literature.Researchers have also begun to explore the technique’s promise for extracting clinically actionable information from biomedical publications and clinical notes. Others have assessed its promise for drawing insights from the masses of medical images and associated reports that hospitals accumulate. 

In social science, studies have used TDM to analyze job advertisements to identify direct discrimination during the hiring process.7 It has also been used to study police officer body-worn camera footage, uncovering that police officers speak less respectfully to Black than to white community members even under similar circumstances.

TDM also shows great promise for drawing insights from literary works and motion pictures. Regarding literature, some 221,597 fiction books were printed in English in 2015 alone, more than a single scholar could read in a lifetime. TDM allows researchers to “‘scale up’ more familiar humanistic approaches and investigate questions of how literary genres evolve, how literary style circulates within and across linguistic contexts, and how patterns of racial discourse in society at large filter down into literary expression.” TDM has been used to “observe trends such as the marked decline in fiction written from a first-person point of view that took place from the mid-late 1700s to the early-mid 1800s, the weakening of gender stereotypes, and the staying power of literary standards over time.” Those who apply TDM to motion pictures view the technique as every bit as promising for their field. Researchers believe the technique will provide insight into the politics of representation in the Network era of American television, into what elements make a movie a Hollywood blockbuster, and into whether it is possible to identify the components that make up a director’s unique visual style [citing numerous letters in support of the TDM exemption from researchers].

3) Text and data mining is not new and it’s not a threat to authors

Text mining of the sort it seemed prosecraft employed isn’t some kind of new phenomenon. Marti Hearst, a professor at UC Berkeley’s iSchool explained the basics in this classic 2003 piece. Scores of computer science students experiment with projects to do almost exactly what prosecraft was producing in their courses each year. Textbooks like Matt Jockers’s Text Analysis with R for Students of Literature have been widely used and adopted all across the U.S. to teach these techniques. Our submissions during our petition for the DMCA exemption for text and data mining back in 2020 included 14 separate letters of support from authors and researchers engaged in text data mining research, and even more researchers are currently working on TDM projects. While fears over generative AI may be justified for some creators (and we are certainly not oblivious to the threat of various forms of economic displacement), it’s important to remember that text data mining on textual works is not the same as generative AI. On the contrary, it is a fair use that enriches and deepens our understanding of literature rather than harming the authors who create it.

An Update on our Text and Data Mining: Demonstrating Fair Use Project

Posted April 28, 2023

Back in December we announced a new Authors Alliance’s project, Text and Data Mining: Demonstrating Fair Use, which is about lowering and overcoming legal barriers for researchers who seek to exercise their fair use rights, specifically within the context of text data mining (“TDM”) research under current regulatory exemptions. We’ve heard from lots of you about the need for support in navigating the law in this area. This post gives a few updates. 

Text and Data Mining Workshops and Consultations

We’ve had a tremendous amount of interest and engagement with our offers to hold hands-on workshops and trainings on the scope of legal rights for TDM research. Already this spring, we’ve been able to hold two workshops in the Research Triangle hosted at Duke University, and a third workshop at Stanford followed by a lively lunch-time discussion. We have several more coming. Our next stop is in a few weeks at the University of Michigan, and we have plans in the works for workshops in the Boston area, New York, a few locations on the West Coast, and potentially others as well. If you are interested in attending or hosting a workshop with TDM researchers, librarians, or other research support staff, please let us know! We’d love to hear from you. The feedback so far has been really encouraging, and we have heard both from current TDM researchers and those for whom the workshops have opened their eyes to new possibilities. 

ACH Webinar: Overcoming Legal Barriers to Text and Data Mining
Join us! In addition to the hands-on in-person workshops on university campuses, we’re also offering online webinars on overcoming legal barriers to text and data mining. Our first is hosted by the Association for Computers and the Humanities on May 15 at 10am PT / 1pm ET. All are welcome to attend, and we’d love to see you online!
Read more and register here. 

Research 

A second aspect of our project is to research how the current law can both help and hinder TDM researchers, with specific attention to fair use and the DMCA exemption that Authors Alliance obtained for TDM researchers to break digital locks when building a corpus of digital content such as ebooks or DVDs.

Christian Howard-Sukhil, Authors Alliance Text and Data Mining Legal Fellow

To that end, we’re excited to announce that Christian Howard-Sukhil will be joining Authors Alliance as our Text and Data Mining Legal Fellow. Christian holds a PhD in English Language and Literature from the University of Virginia and is currently pursuing a JD from the UC Berkeley School of Law. Christian has extensive digital humanities and text data mining experience, including in previous roles at UVA and Bucknell University. Her work with Authors Alliance will focus on researching and writing about the ways that current law helps or hinders text and data mining researchers in the real world. 

The research portion of this project is focused on the practical implications of the law and will be based heavily on feedback we hear from TDM researchers. We’ve already had the opportunity to gather some feedback from researchers including through the workshops mentioned above, and plan to do more systematic outreach over the coming months. Again, if you’re working in this field (or want to but can’t because of concerns about legal issues), we’d love to hear from you. 

At this stage we want to share some preliminary observations, based on recent research into these issues (supported by the work of several teams of student clinicians) as well as our recent and ongoing work with TDM researchers:

1) Licenses restrictions are a problem. We’ve heard clearly that licenses and terms of use impose a significant barrier to TDM research. While researchers are able to identify uses that would qualify as fair use and also many uses that likely qualify under the DMCA exemption, terms of use accompanying ebook licenses can override both. These terms vary, from very specific prohibitions–e.g., Amazon’s, which says that users “may not attempt to bypass, modify, defeat, or otherwise circumvent any digital rights management system”–to more general prohibitions on uses that go beyond the specific permissions of the license–e.g., Apple’s terms, which state that “No portion of the Content or Services may be transferred or reproduced in any form or by any means, except as expressly permitted.” Even academic licenses, often negotiated by university libraries to have  more favorable terms, can still impose significant restrictions on reuse for TDM purposes. Although we haven’t heard of aggressive enforcement of those terms to restrict academic uses, even the mere existence of those terms can have chilling and negative real world impacts on research using TDM techniques.

The problem of licenses overriding researchers rights under fair use and other parts of copyright law is of course not limited to just inhibiting text and data mining research. We wrote about the issue, and how easy it is to evade fair use, a few months ago, discussing the many ways that restrictive license terms can inhibit normal, everyday uses of works such as criticism, commentary and quotation. We are currently working on a separate paper documenting the scope and extent of “contractual override,” and will be part of a symposium on the subject in May, hosted by the Association of Research Libraries and the American University, Washington College of Law Program on Information Justice and Intellectual Property.

2) The TDM exemption is flexible, but local interpretation and support can vary. We’ve heard that the current TDM exemption–allowing researchers to break technological protection measures such as DRM on ebooks and CSS on DVDs–is an important tool to facilitate research on modern digital works. And we believe the terms of that exemption are sufficiently flexible to meet the needs of a variety of research applications (how wide a variety remains to be seen through more research). But local understanding and support for researchers using the exemption can vary. 

For example, the exemption requires that the university that the TDM research is associated with implement “effective security measures” to ensure that the corpus of copyrighted works isn’t used for another purpose. The regulation further explains that in the absence of a standard negotiated with content holders, “effective security measures” means “measures that the institution uses to keep its own highly confidential information secure.” University  IT data security standards don’t always use the same language or define their standard to cover “highly confidential information” and so university IT offices must interpret this language and implement the standard in their own local context. This can create confusion about what precisely universities need to do to secure the TDM corpora. 

Some of these definitional issues are likely growing pains–the exemption is still new and universities need time to understand and implement standards to satisfy its terms in a reasonable way–it will be important to explore further where there is confusion on similar terms and how that might best be resolved. 

3) Collaboration and sharing are important. Text and data mining projects are often conceived of as part of a much larger research agenda, with multiple potential research outputs both from the initial inquiry and follow-up studies with a number of researchers, sometimes from a number of institutions. Fair use clearly allows for collaborative TDM work –e.g., in  Authors Guild v. HathiTrust, a foundational fair use case for TDM research in the US, we observe that the entire structure of HathiTrust is a collective of a number of research institutions with shared digital assets. And likewise, the TDM exemption permits a university to provide access to “researchers affiliated with other institutions of higher education solely for purposes of collaboration or replication of the research.” The collaborative aspect of this work raises some challenging questions, both operationally and conceptually. For example, the exemption for breaking digital locks doesn’t define precisely who qualifies as a researcher who is “affiliated,” leaving open questions for universities implementing the regulation. More conceptually, the issue of research collaboration raises questions about how precisely the TDM purpose must be defined when building a corpora under the existing exemption, for example when researchers collaborate but investigate different research questions over time. Finally, the issue of actually sharing copies of the corpus with researchers at other institutions is important because at least in some cases, local computing power is needed to effectively engage with the data. 

Again, just preliminary research, but some interesting and important questions! If you are working in this area in any capacity, we’d love to talk. The easiest way to reach us is at  info@authorsalliance.org

Want to Learn More?
This current Authors Alliance project is generously supported by the Mellon Foundation, which has also supported a number of other important text and data mining projects. We’ve been fortunate to be part of a broader network of individuals and organizations devoted to lowering legal barriers for TDM researchers. This includes efforts spearheaded by a team at UC Berkeley to produce the “Legal Literacies for Text Data Mining” and its current project to address cross-border TDM research, as well as efforts from the Global Network on Copyright and User Rights, which has (among other things) led efforts on copyright exceptions for TDM globally.