Scholars are increasingly using text data mining to uncover previously unknown patterns, trends, or relationships from a collection of textual documents. In doing so, many of these researchers may be accessing, building, working with, and sharing materials without understanding the legal implications of their actions. In their newly released chapter, Law and Literacy in Non-Consumptive Text Mining: Guiding Researchers Through the Landscape of Computational Text Analysis (in Copyright Conversations: Rights Literacy in a Digital World), Rachael G. Samberg and Cody Hennesy analyze the legal issues that can arise when researchers are engaged in text data mining and provide guidance on how to approach these issues.
As Samberg and Hennesy write, “currently, many [ ] researchers programmatically access and download copyright-protected works—even when it potentially violates copyright, licenses, privacy, or computer fraud law—because it is technically feasible. Few of these researchers are malicious in intent; rather, they may lack the necessary training or support to safely navigate the obscure regulatory environment of the field.”
Samberg and Hennesy’s survey of copyright and other legal issues affecting text data mining addresses:
- Copyright and Fair Use: Samberg and Hennesy review several cases where courts have considered the intersection of full text searching a corpus and fair use and found non-consumptive text mining to be fair. They caution that researchers should understand that while it may be fair use in some cases to create and utilize a database for text data mining, further publishing that database may exceed the bounds of fair use.
- Contract Law: Samberg and Hennesy discuss how contract law may define how researchers can access materials and what use they can make of them, and may even supplant fair use rights. They review the effect of database license agreements, website terms of service, and agreements with archives and special collections on text data mining.
- Ethics: Samberg and Hennesy consider the best practices in responding to requests from web hosts relating to scraping content from the site.
Samberg and Hennesy use this framework to define literacies for researchers based on three stages of outreach and education further articulated in the chapter: use of precompiled corpora, corpus creation, and corpus publishing. The authors conclude that the key literacy is for researchers “to understand the need for a workflow [ ] and to explore a tailored approach in consultation with their librarians.”
This chapter is a valuable contribution towards helping scholars using text data mining to acquire the skills they need to understand and approach the legal implications of their work. Law and Literacy in Non-Consumptive Text Mining: Guiding Researchers Through the Landscape of Computational Text Analysis is available to read in full under a CC-BY license.
____________________________________________________________
Thanks to a grant from the National Endowment for the Humanities (NEH), Rachael Samberg will lead a national team to help humanities researchers and staff navigate complex legal questions in cutting-edge digital research. The institute, Building Legal Literacies in Text Data Mining, will teach humanities researchers, librarians, and research staff how to confidently navigate the major legal issues that arise in text data mining research. Authors Alliance Executive Director Brianna Schofield will take part in the institute in her capacity as a copyright expert. A call for participants is currently open; applications are due December 20, 2019.