Category Archives: Text and Data Mining

Text Data Mining Research DMCA Exemption Renewed and Expanded

Posted October 25, 2024
U.S. Copyright Office 1201 Rulemaking Process, taken from https://www.copyright.gov/1201/

Earlier today, the Library of Congress, following recommendations from the U.S. Copyright Office, released its final rule adopting exemptions to the Digital Millenium Copyright Act’s prohibition on circumvention of technological protection measures (e.g.,  DRM).  

As many of you know, we’ve been working closely with members of the text and data-mining community as well as our co-petitioners, the Library Copyright Alliance (LCA) and the American Association of University Professors (AAUP), to petition for renewal of the existing TDM research exemption and to expand it to allow researchers to share their research corpora with other researchers outside of their university (something not previously allowed). The process began over a year ago and followed an in-depth review process by the U.S. Copyright Office. 

We are very pleased to see that the Librarian of Congress both approved the renewal of the existing exemption and approved an expansion that allows for research universities to provide access to TDM corpora for use by researchers at other universities. 

The expanded rule is poised to make an immediate impact in helping the TDM researchers collaborate and build upon each other’s work. As Allison Cooper,  director of Kinolab and Associate Professor of Romance Languages and Literatures and Cinema Studies at Bowdoin College, explains:

“This decision will have an immediate impact on the ongoing close-up project that Joel Burges, Emily Sherwood, and I are working on by allowing us to collaborate with researchers like David Bamman, whose expertise in machine learning will be valuable in answering many of the ‘big picture’ questions about the close-up that have come up in our work so far.”

These are the main takeaways from the new rule: 

  • The exemption has been expanded to allow “access” to corpora by researchers at other institutions “solely for purposes of text and data mining research or teaching.” There is no more requirement that access be granted as part of a “collaboration,” so new researchers can ask new and different questions of a corpus. Access must be credentialed and authenticated.
  • The issue of whether a researcher can engage in “close viewing” of a copyrighted work has been resolved—as the explanation for the revised rule puts it, researchers can “view the contents of copyrighted works as part of their research, provided that any viewing that takes place is in furtherance of research objectives (e.g., processing or annotating works to prepare them for analysis) and not for the works’ expressive value.” This is a very helpful clarification!
  • The new rule also modified the existing security requirements, which provide that researchers must put in place adequate security protocols to protect TDM corpora from unauthorized reuse and must share information about those security protocols with rightsholders upon request. That rule has been limited in some ways and expanded in others. The new rule clarifies that trade associations can send inquiries on behalf of rightsholders. However, inquiries must be supported by a “reasonable belief” that the sender’s works are in a corpus being used for TDM research.

Later on, we will post a more in-depth analysis of the new rules–both TDM and others that apply to authors. The Librarian of Congress also authorized the renewal of a number of other rules that support research, teaching, and library preservation. Among them is a renewal of another exemption that Authors Alliance and AAUP petitioned for, allowing for the circumvention of digital locks when using motion picture excerpts in multi-media ebooks. 

Thank you to all of the many, many TDM researchers and librarians we’ve worked with over the last several years to help support this petition. 

You can learn more about TDM and our work on this issue through our TDM resources page, here.

Copyright Office Recommends Renewal of the Existing Text Data Mining Exemptions for Literary Works and Films

Posted October 19, 2023
Photo by Tim Mossholder on Unsplash

Authors Alliance is delighted to announce that the Copyright Office has recommended that the Librarian of Congress renew both of the exemptions to DMCA liability for text and data mining in its Notice of Proposed Rulemaking for this year’s DMCA exemptions, released today. While the Librarian of Congress could technically disagree with the recommendation to renew, this rarely if ever happens in practice. 

Renewal Petitions and Recommendations

Authors Alliance petitioned the Office to renew the exemptions in July, along with our co-petitioners the American Association of University Professors and the Library Copyright Alliance. Then, the Office entertained comments from stakeholders and the public at large who wished to make statements in support of or in opposition to renewal of the existing exemptions, before drawing conclusions about renewal in today’s notice. 

The Office did not receive any comments arguing against renewal of the TDM exemption for literary works distributed electronically; our petition was unopposed. The Office agreed with Authors Alliance and our co-petitioners, ARL and AAUP, observing that “researchers are actively relying on the current exemption” and citing to an example of such research that we highlighted in our petition. Apparently agreeing with our statement that there have not been “material changes in facts, law, technology, or other circumstances” since the 1201 rulemaking cycle when the exemption was originally obtained, the Office stated it intended to recommend that the exemption be renewed. 

Our renewal petition for the text and data mining exemption for motion pictures, which is identical to the literary works exemption in all aspects but the type of works involved, did receive one opposition comment, but the Copyright Office found that it did not meet the standard for meaningful opposition, and recommended renewal. DVD CCA (the DVD Copyright Control Association) and AACS LA (the Advanced Access Content System Licensing Administrator) submitted a joint comment arguing that a statement in our petition indicated that there had been a change in the facts surrounding the exemption. More specifically, they argued that our statement that “[c]ommercially licensed text and data mining products continue to be made available to research institutions” constituted an admission that new licensed databases motion pictures had emerged since the previous rulemaking. DVD CCA and AACS LA did not actually offer any evidence of the emergence of new licensed databases for motion pictures. We believed this opposition comment was without merit—while licensed databases for text and data mining of audiovisual works are not as prevalent as licensed databases for text and data mining of text-based works, some were available during the 2021 rulemaking, and continue to be available today. We are pleased that the Office agreed, citing to the previous rulemaking record as supporting evidence.

Expansions and Next Steps

In addition to requesting that the Office renew the current exemptions, we (along with AAUP and LCA) also requested that the Office consider expanding these exemptions to enhance a researcher’s ability to share their corpus with other researchers that are not their direct collaborators. The two processes run in parallel, and today’s announcement means that even if we do not ultimately obtain expanded exemptions, the existing exemptions are very likely to be renewed. 

In its NPRM, the Office also announced deadlines for the various submissions that petitions for expansions and new exemptions will require. The first round of comments in support of  our proposed expansion—including documentary evidence from researchers who are being adversely affected by the limited sharing permitted under the existing exemptions—will be due December 22nd. Opposition comments are due February 20, 2024. Reply comments to these opposition comments are then due March 24, 2024. Then, later in the spring, there will be a hearing with the Copyright Office regarding our proposed expansion. We will—as always—keep our readers apprised as the process moves forward. 

Authors Alliance and Allies Petition to Renew and Expand Text Data Mining Exemption

Posted September 6, 2023
Photo by Alina Grubnyak on Unsplash

Authors Alliance is pleased to announce that in recent weeks, we have submitted petitions to the Copyright Office requesting that it recommend renewing expanding the existing text data mining exemptions to DMCA liability to make the current legal carve-out that enables text and data mining more flexible, so that researchers can share their corpora of works with other researchers who want to conduct their own text data mining research. On each of these petitions, we were joined by two co-petitioners, the American Association of University Professors and the Library Copyright Alliance. These were short filings—requesting changes and providing brief explanations—and will be the first of many in our efforts to obtain a renewal and expansion of the existing TDM exemptions. 

Background

The Digital Millennium Copyright Act (DMCA) includes a provision that forbids people from bypassing technical protection measures on copyrighted works. But it also implements a triennial rulemaking process whereby organizations and individuals can petition for temporary exemptions to this rule. The Office recommends an exemption when its proponents show that they, or those they represent, are “adversely affected in their ability to make noninfringing [fair] uses due to the prohibition on circumventing access controls.” Every three years, petitioners must ask the Office to renew existing exemptions in order for them to continue to apply. Petitioners can also ask the Office to recommend expanding an existing exemption, which requires the same filings and procedure as petitioning for a new exemption. 

Back in 2020, during the eighth of these triennial rulemakings, Authors Alliance—along with the Library Copyright Alliance and the American Association of University Professors—petitioned the Copyright Office to create an exemption to DMCA liability that would enable researchers to conduct text and data mining. Text and data mining is a fair use, and the DMCA prohibitions on bypassing DRM and similar technical protection measures made it difficult or even impossible for researchers to conduct text and data mining on in-copyright e-books and films. After a long process which included filing a comment in support of the exemption and an ex parte meeting with the Copyright Office, the Office ultimately recommended that the Librarian of Congress grant our proposed exemption (which she did). The Office also recommended that the exemption be split into two parts, with one exemption addressing literary works distributed electronically, and the other addressing films. 

While the ninth triennial rulemaking does not technically happen until 2024, petitions for renewals, expansions, and new exemptions have already been filed. 

Our Petitions

Back in early July, we made our first filings with the Copyright Office in the form of renewal petitions for both exemptions. For this step, proponents of current exemptions simply ask the Copyright Office to renew them for another three year cycle, accompanied by a short explanation of whether and how the exemption is being used and a statement that neither law nor technology has changed such that the exemption is no longer warranted. Other parties are then given an opportunity to respond to or oppose renewal petitions. The Office recommends that exemption proponents who want to expand a current exemption also petition for its renewal—which is just what we did. In our renewal petitions, we explained how researchers are using the exemptions and how neither recent case law nor the continued availability of licensed TDM databases represent changes in the law or technology, making renewal of the TDM exemptions proper and justified. The renewal petitions follow a streamlined process, where they are generally simply granted unless the Office finds there to be “meaningful opposition” to a renewal petition, articulating a change in the law or facts. You can find our renewal petition for the literary works TDM exemption here, and our renewal petition for the film TDM exemption here.

But we also sought to expand the current exemptions, in two petitions submitted a few weeks back. In our expansion petitions, we proposed a simple change that we would like to see made to the current DMCA exemptions for text data mining. In the exemption’s current form, academic researchers can bypass technical protection measures to assemble a corpus on which to conduct TDM research, but they can only share it with other researchers for purposes of “collaboration and verification.” We asked the Office to permit these researchers to share their corpora with other researchers who want to use the corpus to conduct TDM research, but are not direct collaborators. However, this second group of researchers would still have to comply with the various requirements of the exemption, such as complying with security measures. Essentially, we seek to expand the sharing provision of the current exemption while leaving the other provisions intact. This is largely based on feedback we have received from those using the exemption and our understanding of how the regulation can be improved so that their desired noninfringing uses are no longer adversely affected by this limitation. You can find our expansion petition for the literary works TDM exemption here, and our expansion petition for the film TDM exemption here.

What’s Next?

The next step in the triennial rulemaking process is the Copyright Office issuing a notice of proposed rulemaking, where it will lay out its plan of action. While we do not have a set timeline for the notice of proposed rulemaking, during the last rulemaking cycle, it happened in mid-October—meaning it is reasonable to expect the Office to issue this notice in the next two months or so. Then, there will be several rounds of comments in support of or in opposition to the proposals. Finally, the Office will issue a final recommendation, and the Librarian of Congress will issue a final rule. While the Librarian of Congress is not legally obligated to adopt the Copyright Office’s recommendations, they traditionally do. Based on last year’s cycle, we can expect a final rule to be issued around October 2024. So we are in for a long wait and a lot of work! We will keep our readers updated as the rulemaking moves forward.

Prosecraft, text and data mining, and the law

Posted August 14, 2023

Last week you may have read about a website called prosecraft.io, a site with an index of some 25,000 books that provided a variety of data about the texts (how long, how many adverbs, how much passive voice) along with a chart showing sentiment analysis of the works in its collection and displayed short snippets from the texts themselves, two paragraphs representing the most and least vivid from the text. Overall, it was a somewhat interesting tool, promoted to authors to better understand how their work compares to those of other published works. 

The news cycle about prosecraft.io was about the campaign to get its creator Benji Smith to take the site down (he now has) based on allegations of copyright infringement. A Gizmodo story about it generated lots of attention, and it’s been written up extensively, for example here, here, here, and here.  

It’s written about enough that I won’t repeat the whole saga here. However, I think a few observations are worth sharing:  

1) Don’t get your legal advice from Twitter (or whatever its called)

Fair Use does not, by any stretch of the imagination, allow you to use an author’s entire copyrighted work without permission as a part of a data training program that feeds into your own ‘AI algorithm.’”  – Linda Codega, Gizmodo (a sentiment that was retweeted extensively)

Fair use actually allows quite a few situations where you can copy an entire work, including situations when you can use it as part of a data training program (and calling an algorithm “AI” doesn’t magically transform it into something unlawful). For example, way back in 2002 in Kelly v. Ariba Soft, the 9th Circuit concluded that it was fair use to make full text copies of images found on the internet for the purpose of enabling web image search. Similarly, in AV ex rel Vanderhye v. iParadigms, the 4th Circuit in 2009 concluded that it was fair use to make full text copies of academic papers for use in a plagiarism detection tool.  

Most relevant to prosecraft, in Authors Guild v. HathiTrust (2014)  and Authors Guild v. Google (2015) the Second Circuit held that Google’s copying of millions of books for purposes of creating a massive search engine of their contents was fair use . Google produced full-text searchable databases of the works, and displayed short snippets containing whatever term the user had searched for (quite similar to prosecraft’s outputs). That functionality also enabled a wide range of computer-aided textual analysis, as the court explained: 

The search engine also makes possible new forms of research, known as “text mining” and “data mining.” Google’s “ngrams” research tool draws on the Google Library Project corpus to furnish statistical information to Internet users about the frequency of word and phrase usage over centuries.  This tool permits users to discern fluctuations of interest in a particular subject over time and space by showing increases and decreases in the frequency of reference and usage in different periods and different linguistic regions. It also allows researchers to comb over the tens of millions of books Google has scanned in order to examine “word frequencies, syntactic patterns, and thematic markers” and to derive information on how nomenclature, linguistic usage, and literary style have changed over time. Authors Guild, Inc., 954 F.Supp.2d at 287. The district court gave as an example “track[ing] the frequency of references to the United States as a single entity (‘the United States is’) versus references to the United States in the plural (‘the United States are’) and how that usage has changed over time.”

While there are a number of generative AI cases pending (a nice summary of them is here) that I agree raise some additional legal questions beyond those directly answered in Google Books, the kind of textual analysis that prosecraft.io offered seems remarkably similar to the kinds of things that the courts have already said are permissible fair uses. 

2) Text and data mining analysis has broad benefits

Not only is text mining fair use, it also yields some amazing insights that truly “promote the progress of Science,” which is what copyright law is all about.  Prosecraft offered some pretty basic insights into published books – how long, how many adverbs, and the like. I can understand opinions being split on whether that kind of information is actually helpful for current or aspiring authors. But, text mining can reveal so much more. 

In the submission Authors Alliance made to the US Copyright Office three years ago in support of a Section 1201 Exemption permitting text data mining, we explained:

TDM makes it possible to sift through substantial amounts of information to draw groundbreaking conclusions. This is true across disciplines. In medical science, TDM has been used to perform an overview of a mass of coronavirus literature.Researchers have also begun to explore the technique’s promise for extracting clinically actionable information from biomedical publications and clinical notes. Others have assessed its promise for drawing insights from the masses of medical images and associated reports that hospitals accumulate. 

In social science, studies have used TDM to analyze job advertisements to identify direct discrimination during the hiring process.7 It has also been used to study police officer body-worn camera footage, uncovering that police officers speak less respectfully to Black than to white community members even under similar circumstances.

TDM also shows great promise for drawing insights from literary works and motion pictures. Regarding literature, some 221,597 fiction books were printed in English in 2015 alone, more than a single scholar could read in a lifetime. TDM allows researchers to “‘scale up’ more familiar humanistic approaches and investigate questions of how literary genres evolve, how literary style circulates within and across linguistic contexts, and how patterns of racial discourse in society at large filter down into literary expression.” TDM has been used to “observe trends such as the marked decline in fiction written from a first-person point of view that took place from the mid-late 1700s to the early-mid 1800s, the weakening of gender stereotypes, and the staying power of literary standards over time.” Those who apply TDM to motion pictures view the technique as every bit as promising for their field. Researchers believe the technique will provide insight into the politics of representation in the Network era of American television, into what elements make a movie a Hollywood blockbuster, and into whether it is possible to identify the components that make up a director’s unique visual style [citing numerous letters in support of the TDM exemption from researchers].

3) Text and data mining is not new and it’s not a threat to authors

Text mining of the sort it seemed prosecraft employed isn’t some kind of new phenomenon. Marti Hearst, a professor at UC Berkeley’s iSchool explained the basics in this classic 2003 piece. Scores of computer science students experiment with projects to do almost exactly what prosecraft was producing in their courses each year. Textbooks like Matt Jockers’s Text Analysis with R for Students of Literature have been widely used and adopted all across the U.S. to teach these techniques. Our submissions during our petition for the DMCA exemption for text and data mining back in 2020 included 14 separate letters of support from authors and researchers engaged in text data mining research, and even more researchers are currently working on TDM projects. While fears over generative AI may be justified for some creators (and we are certainly not oblivious to the threat of various forms of economic displacement), it’s important to remember that text data mining on textual works is not the same as generative AI. On the contrary, it is a fair use that enriches and deepens our understanding of literature rather than harming the authors who create it.

An Update on our Text and Data Mining: Demonstrating Fair Use Project

Posted April 28, 2023

Back in December we announced a new Authors Alliance’s project, Text and Data Mining: Demonstrating Fair Use, which is about lowering and overcoming legal barriers for researchers who seek to exercise their fair use rights, specifically within the context of text data mining (“TDM”) research under current regulatory exemptions. We’ve heard from lots of you about the need for support in navigating the law in this area. This post gives a few updates. 

Text and Data Mining Workshops and Consultations

We’ve had a tremendous amount of interest and engagement with our offers to hold hands-on workshops and trainings on the scope of legal rights for TDM research. Already this spring, we’ve been able to hold two workshops in the Research Triangle hosted at Duke University, and a third workshop at Stanford followed by a lively lunch-time discussion. We have several more coming. Our next stop is in a few weeks at the University of Michigan, and we have plans in the works for workshops in the Boston area, New York, a few locations on the West Coast, and potentially others as well. If you are interested in attending or hosting a workshop with TDM researchers, librarians, or other research support staff, please let us know! We’d love to hear from you. The feedback so far has been really encouraging, and we have heard both from current TDM researchers and those for whom the workshops have opened their eyes to new possibilities. 

ACH Webinar: Overcoming Legal Barriers to Text and Data Mining
Join us! In addition to the hands-on in-person workshops on university campuses, we’re also offering online webinars on overcoming legal barriers to text and data mining. Our first is hosted by the Association for Computers and the Humanities on May 15 at 10am PT / 1pm ET. All are welcome to attend, and we’d love to see you online!
Read more and register here. 

Research 

A second aspect of our project is to research how the current law can both help and hinder TDM researchers, with specific attention to fair use and the DMCA exemption that Authors Alliance obtained for TDM researchers to break digital locks when building a corpus of digital content such as ebooks or DVDs.

Christian Howard-Sukhil, Authors Alliance Text and Data Mining Legal Fellow

To that end, we’re excited to announce that Christian Howard-Sukhil will be joining Authors Alliance as our Text and Data Mining Legal Fellow. Christian holds a PhD in English Language and Literature from the University of Virginia and is currently pursuing a JD from the UC Berkeley School of Law. Christian has extensive digital humanities and text data mining experience, including in previous roles at UVA and Bucknell University. Her work with Authors Alliance will focus on researching and writing about the ways that current law helps or hinders text and data mining researchers in the real world. 

The research portion of this project is focused on the practical implications of the law and will be based heavily on feedback we hear from TDM researchers. We’ve already had the opportunity to gather some feedback from researchers including through the workshops mentioned above, and plan to do more systematic outreach over the coming months. Again, if you’re working in this field (or want to but can’t because of concerns about legal issues), we’d love to hear from you. 

At this stage we want to share some preliminary observations, based on recent research into these issues (supported by the work of several teams of student clinicians) as well as our recent and ongoing work with TDM researchers:

1) Licenses restrictions are a problem. We’ve heard clearly that licenses and terms of use impose a significant barrier to TDM research. While researchers are able to identify uses that would qualify as fair use and also many uses that likely qualify under the DMCA exemption, terms of use accompanying ebook licenses can override both. These terms vary, from very specific prohibitions–e.g., Amazon’s, which says that users “may not attempt to bypass, modify, defeat, or otherwise circumvent any digital rights management system”–to more general prohibitions on uses that go beyond the specific permissions of the license–e.g., Apple’s terms, which state that “No portion of the Content or Services may be transferred or reproduced in any form or by any means, except as expressly permitted.” Even academic licenses, often negotiated by university libraries to have  more favorable terms, can still impose significant restrictions on reuse for TDM purposes. Although we haven’t heard of aggressive enforcement of those terms to restrict academic uses, even the mere existence of those terms can have chilling and negative real world impacts on research using TDM techniques.

The problem of licenses overriding researchers rights under fair use and other parts of copyright law is of course not limited to just inhibiting text and data mining research. We wrote about the issue, and how easy it is to evade fair use, a few months ago, discussing the many ways that restrictive license terms can inhibit normal, everyday uses of works such as criticism, commentary and quotation. We are currently working on a separate paper documenting the scope and extent of “contractual override,” and will be part of a symposium on the subject in May, hosted by the Association of Research Libraries and the American University, Washington College of Law Program on Information Justice and Intellectual Property.

2) The TDM exemption is flexible, but local interpretation and support can vary. We’ve heard that the current TDM exemption–allowing researchers to break technological protection measures such as DRM on ebooks and CSS on DVDs–is an important tool to facilitate research on modern digital works. And we believe the terms of that exemption are sufficiently flexible to meet the needs of a variety of research applications (how wide a variety remains to be seen through more research). But local understanding and support for researchers using the exemption can vary. 

For example, the exemption requires that the university that the TDM research is associated with implement “effective security measures” to ensure that the corpus of copyrighted works isn’t used for another purpose. The regulation further explains that in the absence of a standard negotiated with content holders, “effective security measures” means “measures that the institution uses to keep its own highly confidential information secure.” University  IT data security standards don’t always use the same language or define their standard to cover “highly confidential information” and so university IT offices must interpret this language and implement the standard in their own local context. This can create confusion about what precisely universities need to do to secure the TDM corpora. 

Some of these definitional issues are likely growing pains–the exemption is still new and universities need time to understand and implement standards to satisfy its terms in a reasonable way–it will be important to explore further where there is confusion on similar terms and how that might best be resolved. 

3) Collaboration and sharing are important. Text and data mining projects are often conceived of as part of a much larger research agenda, with multiple potential research outputs both from the initial inquiry and follow-up studies with a number of researchers, sometimes from a number of institutions. Fair use clearly allows for collaborative TDM work –e.g., in  Authors Guild v. HathiTrust, a foundational fair use case for TDM research in the US, we observe that the entire structure of HathiTrust is a collective of a number of research institutions with shared digital assets. And likewise, the TDM exemption permits a university to provide access to “researchers affiliated with other institutions of higher education solely for purposes of collaboration or replication of the research.” The collaborative aspect of this work raises some challenging questions, both operationally and conceptually. For example, the exemption for breaking digital locks doesn’t define precisely who qualifies as a researcher who is “affiliated,” leaving open questions for universities implementing the regulation. More conceptually, the issue of research collaboration raises questions about how precisely the TDM purpose must be defined when building a corpora under the existing exemption, for example when researchers collaborate but investigate different research questions over time. Finally, the issue of actually sharing copies of the corpus with researchers at other institutions is important because at least in some cases, local computing power is needed to effectively engage with the data. 

Again, just preliminary research, but some interesting and important questions! If you are working in this area in any capacity, we’d love to talk. The easiest way to reach us is at  info@authorsalliance.org

Want to Learn More?
This current Authors Alliance project is generously supported by the Mellon Foundation, which has also supported a number of other important text and data mining projects. We’ve been fortunate to be part of a broader network of individuals and organizations devoted to lowering legal barriers for TDM researchers. This includes efforts spearheaded by a team at UC Berkeley to produce the “Legal Literacies for Text Data Mining” and its current project to address cross-border TDM research, as well as efforts from the Global Network on Copyright and User Rights, which has (among other things) led efforts on copyright exceptions for TDM globally.