Tag Archives: text and data mining

AI Licensing: An Interview with Ben Denne of Cambridge University Press

Posted March 17, 2025

We’ve heard from lots of authors with questions about AI licensing of their works by their publishers. Cambridge University Press is one that has been in the news because it has undertaken a project to ask authors to opt into a contract addendum that would allow CUP to license AI rights for their books, giving authors a royalty on AI licensing net revenue. Cambridge has shared an FAQ with authors already, along with a further explanation of its approach last September and a report in January highlighting that it had contacted some 17,000 authors, the majority of whom have opted in. 

Below is an interview with Ben Denne, Director of Publishing, Academic Books, at Cambridge University Press, answering some questions about the program. 

Dave: Thank you, Ben, for talking with me. To start off, could you say what your role is at Cambridge University Press?

Ben: I’m the Director of Publishing for the Academic Books part of the Academic Division of Cambridge. In short, I’m the director overseeing the whole of the books program, the Academic Books program for Cambridge, except for the Bibles. That’s a specialist unit that runs separately that I don’t have anything to do with, but that means our textbooks, our research and reference books, and then we have a kind of small program of more traditional academic titles that sell a bit more to a bit of a wider audience.

Dave:  Thanks. My interest in talking with you is about generative AI licensing. And we’ve had quite a few authors actually forward us some emails that they’ve gotten from Cambridge presenting an AI license addendum to sign that goes with their contract and also an FAQ. I’d like to ask just a few questions about how that’s going and how that works.

What are Cambridge University Press’ goals with AI licensing?

Ben:  That’s a really good question. Broadly speaking, when this started to come our way, which was the same time a couple of years ago as this subject became really noisy. We’re looking at it and thinking, what’s the best way through this? How do we appropriately engage in this conversation? And I think it came back to us thinking about encouraging responsible use and thinking about our role as an academic publisher. 

And I think our role as an academic publisher is to push the academic debate forward, which means that we want our authors’ books to get read. We want them to get used. We want them to get cited. I think that’s really the kind of spirit we came into this conversation with is thinking, these developments are happening, right, that they’re happening anyway and the best thing we can do as a publisher is try and engage with this debate and push it in a direction that we think really helps to underline those principles of how good research is done.

Dave:  One of the things that I’ve seen with CUP’s rollout with this asking authors is, first of all, that you are asking authors.  Could you talk me through that decision? We’ve seen some other publishers in the news just announce that they have licensing deals with technology companies, and there was no outreach to authors as far as we can tell from those publishers. So could you talk through that thought process of this outreach?

Ben: Sure, so for us, when we first looked at this, we have a contract that authors sign, which is, probably in many ways, very similar to contracts that they signed with other publishers, and it includes all sorts of clauses about use and wide ranging licensing rights.  And one of the things it covers is derivative uses for content and the right to make derivatives. Our sense with that is when we looked at this in the context of these AI conversations and licensing, from a legal perspective, we looked at that and thought, well actually that derivative use clause technically does cover us for this kind of work. And I’m sure that’s the conclusion that some other people have reached too.

But we also thought, it just feels a bit like nobody knew that this kind of technology was emerging when they signed those contracts. And so from our perspective, we thought there’s a lot of noise about this subject in the whole ecosystem right now, you know, you can’t read the news without reading about AI, and people are nervous about it, understandably, and all of those kinds of things. So we felt that we should treat this as additional consent and approach it in that spirit. And that really underpins the decision to go out with the addendum for existing contracts.

I don’t want to jump onto any of your other questions, but that kind of principle, that we were going to ask for opt-ins, was important. Authors have to actively opt into this. We’re not saying to them. “if we don’t hear from you, we’ll assume you’ve opted in.” They have to actually come back to us and say that they’re happy for that use to happen.

Dave:  I think one of the things a lot of people don’t think about is how complicated rights clearance is, especially at scale, across a title list that is the size that you have. So this seems to me like a pretty big investment in just doing this process. Could you say how many of these  you have sent out? I gather that you’re doing this in batches, but do you have a sense of the scale of how many author addendum requests you anticipate making over the course of however long this process lasts?

Ben: It’s a really good question and it’s a moving target. At this stage, we have sent out multiple thousands. But I think we have about 45,000 books available in print and digitally at the moment. And we’re working our way through that list systematically. So we’re in the thousands and you’re right. It is a pretty big undertaking you know it’s quite a logistical challenge to do. We had to set up a whole kind of new workflow for doing this. We have a team that are working on the addenda and addressing the questions that authors have and all of those kinds of things.

Dave: This is maybe getting in the weeds, but it seems to me like there’s a pretty big difference between figuring this out for a sole-authored, single-part monograph, for instance, which is mostly what I’ve seen come through, and edited volumes. Have you tried to figure out those more complex books with multiple authors, multiple works within them?

Ben: Yeah, so the way it’s working for us is where we have several contracted authors for a book, we’re contacting them all and all of those authors have to opt in in order for us to agree that we have the licensing rights.

For edited volumes with multiple contributors, we’re not contacting the individual contributors for opt-in and there are a couple of different reasons for that. Typically, they don’t get paid royalties and also it would just be impossible for us to do. I mean, that’s logistically, you know, that’s a huge ask. So what we are doing is we are still contacting the editors for those volumes and the editors will opt in or not. So if the editor opts in, our understanding is that they’re opting in on behalf of the contributors as well.

But for multi-authored works, we get in touch with all of them. And in fact, we have quite a sizable number of books which are stuck because we’ve had some authors opt-in and some authors not opt-in.

Dave: This is a pretty fast-moving technology and I think a lot of authors are feeling just uncertain right now. And so I wonder about the opt-in window, if an author declines to opt in right now, is that it? Is there an opportunity to come back later after the dust settles and say, oh, no, actually, you know, I’d be happy to have my work used in this way? 

Ben:  Yeah, definitely. We’re in the process of putting something in place so that if authors don’t opt in now they are able to come back and opt in later. And by the way, if they don’t opt in, that’s fine for all the reasons that you just said;  some people are queasy about this and that’s okay. We’re not trying to, we’re not putting a hard sell on it

My sense with this is that for some of the people that we’re speaking to who haven’t opted in, it is because they haven’t yet really seen what the kind of use cases are for this kind of technology. Perhaps as those become more public, people will want to come back and opt in. 

I think some of the things that are out there are going to be quite powerful discovery tools in the future. So we want to make sure the authors do have the opportunity to opt in later if they want to, although we can’t, of course, be sure that if people opt in later the same opportunities will necessarily be available then, since this is quite a fast moving area.

Dave:  For your contracts moving forward for front list books, is a clause like this now a default in those agreements or will authors of new books have the option to opt-in or opt-out for AI licensing?

Ben: Good question. Currently, we have put a clause into our contracts to add AI licensing. But, where authors are asking us to remove that clause, we’re taking it out.

And again, coming back to your point before, those authors could opt in later. But for the contracts as they go out, we have it in as a clause now.

Dave:  Okay. So let’s shift to if you’re gathering all of these rights from authors, presumably at some point, then you would actually engage in the licensing with technology companies or others.  Could you say a little bit about that? Do you have any deals in place with tech companies already?  Or, the other thing that I’ve seen is, some publishers have been in the position of not doing those deals directly, but having sort of sub-licensing deals with others- I understand Proquest Clarivate is doing this. And I think Wiley is as well. Do you have any of those deals in place now?

Ben:  We’re still having those conversations at the moment. And we are talking to a range of different people who are looking at this kind of content. 

Dave:  Okay, that’s really helpful to know.

At the beginning, you talked a little bit about Cambridge University Press’s motivations with engaging in this space and doing licensing. Could you talk a little bit about important factors for what might show up in one of those kinds of deals with tech companies?  For instance, one of the things that I think aligns with the sort of values that you outlined at the beginning and that authors care a lot about is credit, right? We know that, especially for academic authors, credit is incredibly valuable and important. And so I wonder if you’ve thought about how ensuring author credit might factor into any sort of downstream deal that CUP might engage in?

Ben: Absolutely. So we’re having exactly those conversations at the moment with anybody that we’re talking to. And we’ve been very clear with our authors when they’ve asked questions about this, and you may have seen this alluded to in some of the information that you’ve had forwarded to you from authors, that those principles of attribution are 100% what we’re focused on.  Really, they’re kind of a red line for us. 

One of the things we’ve been in lots of conversations with people around this technology is the question of at what level does content need to be attributed? Our sense with this is that any kind of meaningful extract from somebody else’s work needs to be cited. 

I’m kind of repeating myself, but that’s how research works. People build on other people’s work, and so in a scenario where content is being ‘discovered’, if we can’t identify and cite that content, it can’t be accurately attributed. So that’s a red line for us.

Dave: Right.  I think figuring out that attribution, like at what level does that attribution need to kick in, is a really tricky thing. It seems to me, that if you’ve got a foundation model that is pulling in some texts and then someone’s using, say ChatGPT to write emails and somewhere in the model it gleans some structural components from sources like academic books,  I don’t think that’s the thing most authors care about – being cited for the fact that you help train this model to understand how to format citations or do other things like that. It’s the intellectual content that matters and that’s the really tricky piece of it.

Ben: Absolutely and I don’t have an easy answer for you there. So we’re having those conversations at the moment, but our sense is that any sort of direct quote, anything that could be, you know, anything that you would consider to be plagiarism or worthy of credit in a non-AI world should be attributed.

Dave:  I realize this question is asking a hypothetical because you don’t have any of these agreements in place yet, but it seems to me there’s a pretty big difference between use of Cambridge books for model training and uses such as for Retrieval Augmented Generation (RAG).

Have you thought about those distinctions in terms of how that might affect differences in Cambridge’s willingness to set a price on those things?  I assume retrieval-augmented generation (RAG) would come with a higher licensing price than others. But could you talk me through that thought process?

Ben: So it’s kind of interesting because I think there’s a little bit of a gray area,  because I think a lot of the RAG tools are combined with some aspect of LLM. So they might belooking to summarize some research or write a brief about X, Y, and Z.

I think it is quite interesting at the moment that most of the questions we get from people who are worried about this are really anxious about LLMs, but I feel like the really exciting place for academia and research is around that kind of retrieval augmented generation because that’s what’s going to help with discoverability for authors.  It is difficult to talk about at the moment because we don’t have any public deals that I can point to. But I’d say a lot of the conversations that we’re having are somewhere between those two things, you know, so it’s a combination of an  LLM that’s generating text and a citation engine or discovery engine sitting over content.

Dave: Leaving aside the legal situation for a moment, one of the things that I hear from authors pretty consistently is the sentiment that with these big technology companies coming in, they feel that these companies are sort of profiting off of content; that they are exploiting. And so they ought to return something to the system and to authors.

But there’s a really different sentiment about what happens when you have, say, academic researchers using content for AI or text data mining purposes to make new discoveries or learn new things both about the texts and about the world around them. We work a lot with text data mining researchers who are interested in large aggregations of content, not so they can build the next OpenAI,  but so they can understand how language has changed over time, or how has culture changed over time.

I wonder from CUP’s perspective, how do those two different kinds of use cases factor into your thinking about downstream licensing deals for AI/ text data mining?

Ben: Yeah, I think for us that the primary thing we’re really trying to lean on, because of course the whole thing is not quite that clear cut, because a lot of the time it’s the big tech companies that are facilitating a lot of that discovery or that a lot of the kind of discovery traffic goes through them. So I think from our perspective, I’m going to say we’re not ruling out working with anyone. We would put anybody– any partner that we had– through the same diligence process that we would have with onboarding anybody else, but we wouldn’t rule out those conversations with anybody. I think for us, the most important thing is coming back to, and I’m going to sound like a stuck record here, but those principles of attribution. And we have had conversations, some preliminary conversations with people who’ve said, “Well, we don’t think it would be possible to do what you’re asking,” and at that point, we’re saying, “well, okay, then you know that’s the red line for us.”

I think there’s quite a bit of cloudy territory between those two things. And I think for us, the most important thing is to make sure that authors are being credited where their work’s being used.

Dave:  All right,  I have a hypothetical that I wanted to give you. So we see that it’s a 20% royalty calculated on net revenue. Let’s say you received $5 million from an AI licensing deal. Can you walk me through how that might work out for the author? How do you calculate net revenue on that? And then, how that the individual author sitting there sees CUP signs a big deal. What can they expect?

Ben: That’s a tricky one because it would depend a little bit on the terms of the deal as well. But broadly speaking, the principle is, if that’s the net revenues that we receive, so in your situation, you had five million in there, the full licensing payment, is divided out across the list of titles. Authors then earn the royalty for that sale or license type per title, as they do now with all other forms of licensing. 

But, then, where a licensee can provide accurate title-level usage within their royalty statements, this would instead be used. So in an LLM situation that you were just talking about, that would be divided among those books.  With the retrieval augmented generation tool, I think that would work much more around the basis of usage. So, depending on what searches within that tool were bringing back particular content, then we would be attributing revenue that way.

Dave: Okay, that makes a lot of sense. I think this was in the FAQ: one of your use cases is in an authoritative database that’s used on a perpetual basis. But there was somewhere that talked about the removal of content once a licensing term has ended. I wonder if you’ve developed thinking internally about what a standard term would be, how long these things might last? 

Ben:  Yeah, I mean, it’s hard, isn’t it? Because where you’re licensing content to train an LLM, it would be sort of insincere to dress that up. Generally most agreements would be governed by a 2-5 year training term and at the end of that term the training data set would be destroyed, however, they would retain the output from the specific models that were developed during the training term. If they wanted to create new models they would need to renew the license/extend the term. 

For some of the other uses that’s all being discussed at the moment. I think there is still work on this, but there would be standard partnership length terms. What I would say is that from our perspective, we think it’s quite likely in the next few years, the focus will move more away from training large language models and into that area of discovery that these are going to become quite important revenue streams for academic publishers. 

Dave: Thanks, very helpful. As you work on these deals, what level of transparency do you plan on offering authors or the general public about what these licenses might look like? At least with other publishers, it’s been quite mysterious – I think with one, we learned about an AI licensing deal in a quarterly earnings report, for instance. I think authors do really care about what the details of these deals look like. 

Ben:  It’s tricky, isn’t it? It’s hard for me to talk about a deal that hasn’t been done already, and of course, these deals can be subject to the same commercial confidentiality requirements as any other partnership. But I think it’s fair to say that Cambridge University Press would endeavor to be pretty transparent about what we’re doing generally and most importantly, be transparent about why we’re doing it. So I don’t think we’d be concealing that information from anybody. And coming back to my point before, we’ve been quite clear that we only want to enter into these kinds of conversations with people that we think are using content responsibly, and we’d always aim to be open.

Dave: A few final questions. First, CUP has published a number of open-access books. For example, I believe CUP was part of the TOME initiative.  Do you feel like this kind of addendum is necessary for those open-access books, given that they already have some sort of open license attached to them? Or do you think that this is a necessary addition to those OA licenses? 

Ben: That’s a really good question, and it’s something that we’re grappling with at the moment. Without getting into the kind of weeds around open access, some of it depends on the license. Historically for books, our default license open access license was a Creative Commons CC BY NC license, which prohibits commercial reuse. I think at the moment, we’re looking at that (and I think a lot of publishers would say the same thing) and working through how that fits with AI licensing with commercial AI companies. The short answer to your question is if you have a CC BY license, then, people do have a broad license to reuse that content. So at the moment, we’re not actively going after those authors for opt-ins, nor are we including those books in licensing deals.

That we’re doing, but that’s also a relatively small number of books. I can say, we are now looking at using more CC-BY-NC-ND as the default, which restricts the creation of derivative works. You’ve touched on a conversation that is evolving, but we would be treating AI usage as requiring a derivative license and therefore not covered under a CC-BY-NC-ND license. 

Dave:  Thanks, that’s very helpful and I think that’s something a lot of authors are trying to figure out: how does AI downstream use factor into Creative Commons licensed works? And of course, the underlying legal situation matters. I didn’t ask, but I assume that the rights that you’re asking for in this addendum are worldwide, since that affects for example whether usage might be permitted under national law. 

Ben: Yes, the rights are worldwide. 

And thinking again about that, I mean, it’s interesting, isn’t it? Because even under the CC-BY license, it doubles down on that principle of attribution as well. That’s the nature of the license so some uses even then may not be covered by that license.

Dave:  Right. That attribution piece under the CC-BY license will be an important one [note: this issue is being litigated, most prominently in the Doe v. Github suit]. And then, there’s also the underlying question of what the law allows independently even if there is no license–open license or otherwise. I know right now there’s a consultation that just closed in the UK about what the law should be, and in the US, we’re fighting these things out in the courts. I think there are 39 lawsuits right now pending about various aspects of this, and a key question in most of them is just how far fair use goes. And of course, you know, if fair use applies then you don’t have to worry too much about what the license says, whether it’s CC BY or CC BY NC ND or anything else.  This is like reading tea leaves but I think the prevailing case law indicates that model training and coming up with the weights has a pretty strong fair use case, but for the output side, that’s where I think it starts to stumble a little bit when you’ve got systems that are producing outputs that are substantially similar to the inputs. So I wouldn’t be surprised if in some of these suits, we get a ruling in favor of fair use and then in some of them we get a different outcome. And then, the landscape is just sort of messy.

And I suppose in the UK, I imagine y’all are watching what that legal landscape looks like around the world as it’s changing.

Ben:  Yeah, absolutely. 

Dave: One final question: we’ve talked a lot about licensing books for AI, but CUP has a substantial journal portfolio as well. Can you say anything about CUP’s approach to use of journal content either as AI training data or for other AI uses? 

Ben: We’ve been more focussed on books, as this is where most of the demand has been to date, but we have seen a developing interest in journal content. We are, therefore, currently exploring this form of licensing in a consultative way with our journal partners. 

Dave:  Well, thank you for talking. And this was really, really helpful. And I think that this will be useful for authors who are trying to understand just more about what’s going on.

Ben:  It’s been a pleasure.

Authors Alliance Comment on US AI Action Plan

Posted March 14, 2025

Today, we submitted a response to a Request for Information from the Office of Science and Technology Policy (OSTP). The OSTP is seeking to develop an “AI Action Plan,” to sustain and accelerate the development of AI in the United States.  As an organization dedicated to advancing the interests of authors who wish to share their works broadly for the public good, we felt it imperative to weigh in on critical copyright and policy issues impacting AI innovation and access to knowledge.

In our response, we reaffirmed our belief that the use of copyrighted works specifically for AI training (distinct from other AI uses) is a quintessential fair use. We noted that Section 1202(b) of the Copyright Act has little utility and serves as an unnecessary stumbling block to the development of AI. We also highlighted the importance of high quality training data and pointed towards the work that is already being done to develop AI training corpora.  

A Few Key Points from Our Submission

Our response to the OSTP highlights several key areas where federal policy can support both authors and a thriving AI research environment:

1. The Role of Fair Use in AI Model Training

We emphasize that fair use has long been a cornerstone of innovation in the U.S.—enabling everything from web search engines to digitization projects. US Copyright law has played a major role in both developing the incredible creative industries homed in the US, as well as driving leading scientific research and commercial innovation. The key to this innovation policy has been a thoughtful balance between providing a degree of control over copyrighted works to copyright holders while allowing for flexibility when it comes to technological innovation and new transformative uses. AI development relies on the ability to analyze large datasets, many of which include copyrighted materials. The uncertainty surrounding the legal status of AI training data due to ongoing litigation threatens to slow innovation. We urge the federal government to explicitly support the application of fair use to AI training and provide much-needed clarity.

2. Addressing the Contractual Override of Fair Use

Many AI developers face contractual barriers that limit their ability to make fair use of content, particularly in text and data mining applications. We recommend legislative measures to prevent contracts from overriding fair use rights, ensuring that AI researchers and developers can continue innovating without undue restrictions.

3. Access to High Quality Datasets

Access to high-quality datasets is a foundational pillar for AI development, enabling models to learn, refine, and iteratively improve. However, the availability of such datasets is often hindered by restrictive licensing agreements, proprietary controls, and inconsistent data standards. To maximize the potential of AI while ensuring ethical and legally sound development, collaborations between academic institutions, libraries, public archives, and technology developers are essential. Government policies should facilitate public-private partnerships that allow for robust and thoughtfully curated datasets, ensuring that AI systems are trained on a rich range of representative materials.

We invite our community of authors, researchers, and policymakers to review our submission. Your engagement is crucial in shaping a responsible and forward-thinking AI policy in the U.S. You can always reach us at info@authorsalliance.org

Updates on AI Copyright Law and Policy: Section 1202 of the DMCA,  Doe v. Github, and the UK Copyright and AI Consultation 

Posted March 7, 2025
some district courts have applied DMCA 1202(b) to physical copies, including textile, which means if you cut off parts of a fabric that contain copyright information, you could be liable for up to $25,000 in damages

The US Copyright Act has never been praised for its clarity or its intuitive simplicity—at a whopping 460 pages long, it is filled with hotly debated ambiguities and overly complex provisions. The copyright laws of most other jurisdictions aren’t much better.

Because of this complexity of copyright law, the implications of changes to copyright law and policy are not always clear to most authors. As we’ve said in the past, many of these issues seem arcane, and largely escape public attention. Yet entities with a vested interest in maximalist copyright—often at odds with the public interest—are certainly paying attention, and often claim to speak for all authors when they in fact represent only a small subset.  As part of our efforts to advocate for a future where copyright law offers ample clarity, certainty, and real focus on values such as the advancement of knowledge and free expression, we would like to share with you two recent projects we undertook:

The 1202 Issue Brief and Amicus Brief in Doe v. Github

Authors Alliance has been closely monitoring the impact of Digital Millennium Copyright Act (DMCA) Section 1202. As we have explained in a previous post, Section 1202(b) creates liability for those who remove or alter copyright management information (CMI) or distribute works with removed CMI. This provision, originally intended to prevent wide-spread piracy, has been increasingly invoked in AI copyright lawsuits, raising significant concerns for lawful use of copyrighted materials beyond training AI. While on its face, penalties for removing CMI might seem somewhat reasonable, the scope of CMI (including a wide variety of information such as website terms of service, affiliate links, and other information) combined with the challenge of including it with all downstream distribution of incomplete copies (imagine if you had to replicate and distribute something like the Amazon Kindle terms of service every time you quoted text from an ebook) could be potentially very disruptive for many users. 

In order to address the confusion regarding the (somewhat inaptly named) “identicality requirement” by the courts in the 9th Circuit, we have released an issue brief, as well undertaken to file an amicus brief in the Doe v. Github case now pending in the 9th Circuit.

Here are the key reasons why we care—and why you should care—about this seemingly obscure issue:

  • The Precedential Nature of Doe v. Github: The upcoming 9th Circuit case, Doe v. GitHub, will address whether Section 1202(b) should only apply when copies made or distributed are identical (or nearly identical) to the original. Lower courts have upheld this identicality requirement to prevent overbroad applications of the law, and the appellate ruling may set a crucial precedent for AI and fair use.
  • Potential Impact on Otherwise Legal Uses: It is not entirely certain if fair use is a defense to 1202(b) claims. If the identicality requirement is removed, Section 1202(b) could create liability for transformative fair uses, snippet reuse, text and data mining, and other lawful applications. This would introduce uncertainty for authors, researchers, and educators who rely on copyrighted materials in limited, legal ways. We advocate for maintaining the identicality requirement and clarifying that fair use applies as a defense to Section 1202 claims. 
  • Possibility of Frivolous Litigation: Section 1202(b) claims have surged in recent years, particularly in AI-related lawsuits. The statute’s vague language and broad applicability have raised fears that opportunistic litigants could use it to chill innovation, scholarship, and creative expression.

To find out more about what’s at stake, please take a look at our 1202(b) Issue Brief. You are also invited to share your stories with us, on how you have navigated this strange statute. 

Reply to the UK Open Consultation on Copyright and AI

We have members in the UK, and many of our US-based members publish in the UK. We have been watching the development in UK copyright law closely, and have recently filed a comment to the UK Open Consultation on Copyright and AI. In our comment, we emphasized the importance of ensuring that copyright policy serves the public interest. Our response’s key points include:

  • Competition Concerns: We alerted the policy-makers that their top objective must include preventing monopolies forming in the AI space. If licensing for AI training becomes the norm, we foresee power consolidating in a handful of tech companies and their unbridled monopoly permeating all aspects of our lives within a few decades—if not sooner. 
  • Fair Use as a Guiding Principle: We strongly believe that the use of works in the training and development of AI models constitutes fair use under US law. While this issue is currently being tested in courts, case law suggests that fair use will prevail, ensuring that AI training on copyrighted works remains permissible. The UK does not have an identical fair use statute, but has recognized that some of its functions—such as flexibility to permit new technological uses—are valuable. We argue that the wise approach is for the UK to update its laws to ensure its creative and tech sectors can meaningfully participate in the global arena. Our comment called for a broad AI and TDM exception allowing temporary copies of copyrighted works for AI training. We emphasized that when AI models extract uncopyrightable elements, such as facts and ideas, this should remain lawful and protected. 
  • Noncommercial Research Should Be Protected: We strongly advocated for the protection of noncommercial AI research, arguing that academic institutions and their researchers should not face legal barriers when using copyrighted works to train AI models for research purposes. Imposing additional licensing requirements would place undue burdens on academic institutions, which already pay significant fees to access research materials.

Fair Use, Censorship, and Struggle for Control of Facts

Posted February 27, 2025
Caption: 451 is the http error code when a webpage is unavailable for legal reasons; it is also the temperature at which books catch fire and burn. This public domain image is taken inside the Internet Archive

Imagine this: a high-profile aerospace and media billionaire threatens to sue you for writing an unauthorized and unflattering biography. In the course of writing, you rely on several news articles, including a series of in-depth pieces about the billionaire’s life written over a decade earlier. Given their closeness in time to real events, you quote, sometimes extensively, from those articles in several places. 

On the eve of publication, your manuscript is leaked. Through one of his associated companies, the billionaire buys up the copyrights to the articles from which you quote. The next day the company files an infringement lawsuit against you. 

Copyright Censorship: a Time-Honored Tradition

It’s easy to imagine such a suit brought by a modern billionaire—perhaps Elon Musk or Jeff Bezos. But using copyright as a tool for censorship is a time-honored tradition. In this case, Howard Hughes tried it out in 1966, using his company Rosemont Enterprises to file suit against Random House for a biography it would eventually publish.

As we’ve seen many times before and since, the courts turned to copyright’s “fair use” right to rescue the biography from censorship. Fair use, the court explained, exists so that “courts in passing upon particular claims of infringement must occasionally subordinate the copyright holder’s interest in a maximum financial return to the greater public interest in the development of art, science and industry.” 

Singling out the biographical nature of the work and its importance in surfacing underlying facts, the court explained: 

Biographies, of course, are fundamentally personal histories and it is both reasonable and customary for biographers to refer to and utilize earlier works dealing with the subject of the work and occasionally to quote directly from such works. . . . This practice is permitted because of the public benefit in encouraging the development of historical and biographical works and their public distribution, e.g., so “that the world may not be deprived of improvements, or the progress of the arts be retarded.”

Fair use playing this role is no accident. As the Supreme Court has explained, the relationship between copyright and free expression is complicated. On the one hand, the Court has explained,  “[T]he Framers intended copyright itself to be the engine of free expression. By establishing a marketable right to the use of one’s expression, copyright supplies the economic incentive to create and disseminate ideas.” But, recognizing that such exclusive control over expression could chill the very speech copyright seeks to enable, the law contains what the Court has described as two “traditional First Amendment safeguards” to ensure that facts and ideas remain available for free reuse: 1) protections against control over facts and ideas, and 2) fair use. 

But rescuing a biography that merely quotes, even extensively, from earlier articles seems like an easy call, especially when it seems so clear that the plaintiff has so clearly engineered the copyright suit not to protect legitimate economic interests but to suppress an unpopular narrative.  

The world is a little more complicated now. Can fair use continue to protect free expression from excessive enforcement of copyright? I think so, but two key areas are at risk: 

Fair Use and the Archives

It may have escaped your notice that large chunks of online content disappear each year. 

For years, archivists have recognized and worked to address the problem. Websites going dark is an annoyance for most of us, but in some cases, it can have real implications for understanding recent history, even as officially documented. For example, back in 2013, a report revealed that well over half of the websites linked to in Supreme Court opinions no longer work, jeopardizing our understanding of just what went into why and how the Court decided an issue.  

While most websites disappear from benign neglect, others are intentionally taken down to remove records from public scrutiny.  Exhibit A may be the 8,000+ government web pages recently removed by the new presidential administration, but there are many other examples (even whole “reputation management” firms devoted to scrubbing the web of information that may cast one in an unfavorable light). 

The most well-known bulwark against disappearing internet content is the Internet Archive, which has, at this point, archived over 900 billion web pages. Over and over again, we’ve seen its WayBack Machine used to shine a light on history that powerful people would rather have hidden. It’s also why the WayBack Machine has been blocked or threatened at various times in China, Russia, India, and other jurisdictions where free expression protections are weak.

It’s not just the open web that is disappearing. A recent report on the problem of “Vanishing Culture” highlights how this challenge pervades modern cultural works. Everything from 90s shareware video games to the entirety of the MTV News Archive are at risk.  As Jordan Mechner, a contributor to the report explains, “historical oblivion is the default, not the exception” to the human record. As the report explains, it’s not just disappearing content that poses a problem: libraries and consumers must grapple with electronic content that can be remotely changed by publishers or others as well. As just one example among many, in just the last few years we’ve seen surreptitious modifications to ebooks on readers’ devices—some changing important aspects of the plot—for works by authors such as RL Stine, Roald Dahl, and Agatha Christie.  

The case for preservation as a foundational necessity to combat censorship is straightforward. “There is no political power without power over the archive,” Jacques Derrida reminds us. Without access to a stable, high-fidelity copy of the historical record, there can be no meaningful reflection on what went right or wrong, or holding to account those in power who may oppose an accurate representation of their past. 

What sometimes goes unnoticed is that, without fair use, a large portion of these preservation efforts would be illegal. 

In a world where century-long copyright protection applies automatically to any human expression with even a “modicum of creativity,” virtually everything created in the last century is subject to copyright. This is a problem for digital works because practically any preservation effort involves making copies—often lots of them—to ensure the integrity of the content. Making those copies means that archivists must rely on fair use to preserve these works and make them available in meaningful ways to researchers and others. 

The upshot is that every time the Internet Archive archives a website, it’s an act of faith in fair use. Is that faith well-founded? 

I think so. But the answer is complicated. 

For preservation efforts like those of the Internet Archive, fair use is a foundation, but not an unshakable one. Two recent cases highlight the risk, one against its book lending program and the other objecting to its “Great 78” record project. Both take issue with how the Archive provides access to preserved digital copies in its collections. While not directly attacking the preservation of those materials, the suits effectively jeopardize their effective use. As archivists have long lamented, “preservation without access is pointless.” 

Beyond direct challenges to fair use, archives are threatened by spurious takedown demands, content removal requests, and legal challenges. Organizations like the Internet Archive have fought back, but many institutions simply cannot afford to, leading to a chilling effect where preservation efforts are scaled back or abandoned altogether.

Compounding this uncertainty is the growing use of technological protection measures (TPMs) and digital rights management (DRM) systems that restrict access to digital works. Under the Digital Millennium Copyright Act (DMCA), circumventing these restrictions is illegal—even for lawful purposes like preservation or research. This creates a paradox where a researcher or archivist may have a clear fair use justification for accessing and copying a work, but breaking an encryption lock to do so could expose them to legal liability.

Additionally, the rise of contractual overrides—such as restrictive licensing agreements on digital platforms—threatens to sideline fair use entirely. Many modern works, including e-books, streaming media, and even scholarly databases, are governed by terms of service that explicitly prohibit copying or analysis, even for noncommercial research. These contracts often supersede fair use rights, leaving archivists and researchers with no legal recourse.

Still, there are reasons for optimism. Courts have generally ruled favorably when fair use is invoked for transformative purposes, such as digitization for research, searchability, and access for disabled users. Landmark decisions, like those in Authors Guild v. Google and Authors Guild v. HathiTrust, upheld fair use in the context of large-scale digital libraries and text-mining projects. These cases suggest that courts recognize the essential role fair use plays in making knowledge accessible, particularly in an era of vast digital information.

Fair Use and the Freedom to Extract 

One of copyright’s other traditional First Amendment protections is that the copyright monopoly does not extend to facts or ideas. Fair use is critical in giving life to this protection by ensuring that facts and ideas remain accessible, providing a “freedom to extract” (a term I borrow from law professor Molly Van Houweling’s recent scholarship) even when they are embedded within copyrighted works. 

Copyright does not and cannot grant exclusive control over facts, but in practice, extracting those facts often requires using the work in ways that implicate the rightsholder’s copyright. Whether journalists referencing past reporting, historians identifying truths in archival materials, or researchers analyzing a vast corpus of written works, fair use provides the necessary legal space to operate without running afoul of copyright protections for rightsholders. 

The need is more urgent than ever given the sheer scale of the modern historical record.   In many cases, relying on individual researchers to sift through the record and extract important facts is impractical, if not impossible. Automated tools and processes, including AI and text data mining tools, are now indispensable for processing, retrieving, and analyzing facts from large amounts of massive amounts of text, images, and audio. From uncovering patterns in historical archives to verifying political statements against prior records, these tools serve as extensions of human analysis, making the extraction of factual information possible at an unprecedented scale. However, these technologies depend on fair use. If every instance of text or data mining required explicit permission from rights holders—who may have economic or political incentives to deny access—the ability to conduct meaningful research and discovery would be crippled.

For example, consider a researcher studying the roots of the opioid crisis, trying to mine the 4 million documents in the Opioid Industry Documents Archive—many of them legal materials, internal company communications, and regulatory filings. These documents, made public through litigation, provide critical insights into how pharmaceutical companies marketed opioids, downplayed their risks, and shaped public policy. But making sense of such a massive trove of records is impossible without computational tools that can analyze trends, track key players, and surface hidden patterns. 

Without fair use, researchers could face legal roadblocks to applying text and data mining techniques to extract the facts buried within these documents. If copyright law were used to restrict or complicate access to these records, it would not only hamper academic research but also shield corporate and governmental actors from exposure and accountability.

Conclusion

As information continues to proliferate across digital media, fair use remains one of the few safeguards ensuring that historical records and cultural artifacts do not become permanently locked away behind copyright barriers. It allows the past to be examined, challenged, and understood. If we allow excessive copyright restrictions to limit the ability to extract and analyze our shared past and culture, we risk not only stifling innovation but also eroding our collective ability to engage with history and truth.

Fair Use Week

This is my contribution to Fair Use Week. The read the other excellent posts from this week, check out Kyle Courtney’s Harvard Library Fair Use Week blog here.

The Public Interest Corpus: An Update and Opportunities for Co-Development 

Posted February 24, 2025
A Library salute to National Photography Month and the photographer’s skill for staging eye-catching compositions

In December 2024 we announced a new project to develop a public interest AI training corpus focused on books. Over the last few months we’ve been actively engaging a diverse set of stakeholders in the development of The Public Interest Corpus

The Public Interest Corpus is focused on developing large-scale, high-quality AI training data from the world’s memory organizations that serve the public interest. In the aggregate, memory organizations like libraries and archives are in a prime position to address this need given a multi-century focus on developing high-quality, locally and globally comprehensive collections of books, newspapers, scholarly journals, photographs, manuscript materials, and more. We seek to prioritize uses of The Public Interest Corpus that promote learning, access to knowledge, and broad benefits to the public. 

Project Team and Advisory Board

The  project team consists of Dave Hansen, Executive Director of Authors Alliance and Dan Cohen, Vice Provost for Information Collaboration, Dean of the Library, and Professor of History at Northeastern University. In January, I joined the team as the Public Interest AI Strategist. In this capacity I will leverage extensive experience developing community around responsible computational use of memory organization collections as data and responsible AI.  Giulia Taurino, recently joined the team as Project Coordinator. Giulia holds a doctoral degree in Media Studies and Visual Arts from the University of Bologna and the University of Montreal and is currently a member of the NULab for Digital Humanities and Computational Social Science and of AI & Arts interest group at The Alan Turing Institute.

The project team is guided by a strong advisory board composed of senior leaders and experts who think deeply about how authors, libraries, and AI can better serve the public interest. 

  • David Bamman, Associate Professor, UC Berkeley School of Information
  • Sandra Aya Enimil, Director of Scholarly Communications and Collection Strategy, Yale University Library
  • Mike Furlough, Executive Director, HathiTrust
  • David Smith, Associate Professor, Khoury College of Computer Sciences, Northeastern University
  • Claire Stewart, Dean of Libraries and University Librarian, University of Illinois, Urbana-Champaign 
  • Mehtab Khan, Assistant Professor of Law at Cleveland State University College of Law
  • Rachael Samberg, Director,  Scholarly Communications and Information Policy, UC Berkeley Library
  • Robin Sloan, NY Times best selling science fiction author
  • Günter Waibel, Associate Vice Provost & Executive Director, California Digital Library
  • Martha Whitehead, Vice President for the Harvard Library and University Librarian, Harvard University
  • John Wilkin, CEO, LYRASIS
  • Suzanne Wones, University Librarian, UC Berkeley Library
  • Ted Underwood, Professor of Information Science and English, University of Illinois at Urbana Champaign

How you can get involved 

Over the next year the project team will engage a diverse set of stakeholders in a co-development process that directly informs The Public Interest Corpus priorities, strategies, and partnerships. To kick things off we are holding a working event at Northeastern University Library in Boston, Massachusetts on March 3 where a group of senior library administrators, publishers, disciplinary researchers, authors, and technical experts will workshop core legal, technical, business model, and governance challenges. 

Moving forward we intend to hold additional focused in-person and virtual working events with a broad range of communities. We strongly believe that engaging with diverse stakeholders in a co-development process for this effort will be key to success. If you are interested in participating in a future event, hosting a Public Interest Corpus event, or have other ideas for how we might collaborate please let us know via the following form.

We look forward to advancing a public interest solution with you all.

Restricting Innovation: How Publisher Contracts Undermine Scholarly AI Research

Posted December 6, 2024
Photo by Josh Appel on Unsplash

This post is by Rachael Samberg, Director, Scholarly Communication & Information Policy, UC Berkeley Library and Dave Hansen, Executive Director, Authors Alliance

This post is about the research and the advancement of science and knowledge made impossible when publishers use contracts to limit researchers’ ability to use AI tools with scholarly works. 

Within the scholarly publishing community, mixed messages pervade about who gets to say when and how AI tools can be used for research reliant on scholarly works like journal articles or books. Some scholars voiced concern (explained more here) when major scholarly publishers like Wiley or Taylor & Francis entered lucrative contracts with big technology companies to allow for AI training without first seeking permission from authors. We suspect that these publishers have the legal right to do so since most publishers demand that authors hand over extensive rights in exchange for publishing their work. And with the backdrop of dozens of pending AI copyright lawsuits, who can blame the AI companies for paying for licenses, if for no other reason than avoiding the pain of litigation? While it stings to see the same large commercial, academic publishers profit yet again off of the work academic authors submit to them for free, we continue to think there are good ways for authors to retain a say in the matter. 

 Big tech companies are one thing, but what about scholarly research? What about the large and growing number of scholars who are themselves using scholarly copyrighted content with AI tools to conduct their research? We currently face a situation in which publishers are attempting to dictate how and when researchers can do that work, even when authors’ fair use rights to use and derive new understandings from scholarship clearly allow for such uses. 

How vendor contracts disadvantage US researchers

We have written elsewhere (in an explainer and public comment to the Copyright Office) why training AI tools, particularly in the scholarly and research context, constitutes a fair use under U.S. Copyright law. Critical for the advancement of knowledge, training AI is based on a statutory right already held by all scholarly authors engaging in computational research and one that lawmakers should preserve. 

The problem U.S. scholarly authors presently face with AI training is that publishers restrict their access to these statutory rights through contracts that override them: In the United States, publishers can use private contracts to take away statutory fair use rights that researchers would otherwise hold under Federal law. In this case, the private contracts at issue are the electronic resource (e-resource) license agreements that academic research libraries sign to secure campus access to electronic journal, e-book, data, and other content that scholars need for their computational research.

Contractual override of fair use is a problem that disparately disadvantages U.S. researchers. As we have described elsewhere, more than forty countries, including the European Union, expressly reserve text mining and AI training rights for scientific research by research institutions. Not only do scholars in these countries not have to worry whether their computational research with AI is permitted, but also: They do not risk having those reserved rights overridden by contract. The European Union’s Copyright Digital Single Market Directive and recent AI Act nullify any attempt to circumscribe the text and data mining and AI training rights reserved for scientific research within research organizations. U.S. scholars are not as fortunate. 

In the U.S., most institutional e-resource licenses are negotiated and managed by research libraries, so it is imperative that scholars work closely with their libraries and advocate to preserve their computational research and AI training rights within the e-resource license agreements that universities sign. To that end, we have developed adaptable licensing language to support institutions in doing that nationwide. But while this language is helpful, the onus of advocacy and negotiation for those rights in the contracting process remains. Personally, we have found it helpful to explain to publishers that they must consent to these terms in the European Union, and can do so in the U.S. as well. That, combined with strong faculty and administrative support (such as at the University of California), makes for a strong stance against curtailment of these rights.

But we think there are additional practical ways for libraries to illustrate—both to publishers and scholarly authors—exactly what would happen to the advancement of knowledge if publishers’ licensing efforts to curtail AI training were successful. One way to do that is by “unpacking” or decoding a publisher’s proposed licensing restriction, and then demonstrating the impact that provision would have on research projects that were never objectionable to publishers before, and should not be now. We’ll take that approach below.

Decoding a publisher restriction

A commercial publisher recently proposed the following clause in an e-resource agreement:

Customer [the university] and its Authorized Users [the scholars] may not:

  1. directly or indirectly develop, train, program, improve, and/or enrich any artificial intelligence tool (“AI Tool”) accessible to anyone other than Customer and its Authorized Users, whether developed internally or provided by a third party; or
  2. reproduce or redistribute the Content to any third-party AI Tool, except to the extent limited portions of the Content are used solely for research and academic purposes (including to train an algorithm) and where the third-party AI Tool (a) is used locally in a self-hosted environment or closed hosted environment solely for use by Customer or Authorized Users; (b) is not trained or fine-tuned using the Content or any part thereof; and (c) does not share the Content or any part thereof with a third party.  

What does this mean?

  • The first paragraph forbids the training or improving of any AI tool if it’s accessible or released to third parties. And, it further forbids the use of any computational outputs or analysis that are derived from the licensed content from being used to train any tool available to third parties. 
  • The second paragraph is perhaps even more concerning. It provides that when using third party AI tools of any kind, a scholar can use only limited portions of the licensed content with the tools, and are prohibited from doing any training at all of third party tools even if it’s a non-generative AI tool and the scholar is performing the work in a completely closed and highly secure research environment.

What would the impact of such a restrictive licensing provision be on research? 

It would mean that every single one of the trained tools in the following projects could never be disseminated. In addition, for the projects below that used third-party AI tools, the research would have been prohibited full-stop because the third-party tools in those projects required training which the publisher above is attempting to prevent:

Tools that could not be disseminated

  1. In 2017, chemists created and trained a generative AI tool on 12,000 published research papers regarding synthesis conditions for metal oxides, so that the tool could identify anticipated chemical outputs and reactions for any given set of synthesis conditions entered into the tool. The generative tool they created is not capable of reproducing or redistributing any licensed content from the papers; it has merely learned conditions and outcomes and can predict chemical reactions based on those conditions and outcomes. And this beneficial tool would be prohibited from dissemination under the publisher’s terms identified above.
  2. In 2018, researchers trained an AI tool (that they had originally created in 2014) to understand whether a character is “masculine” or “feminine” by looking at the tacit assumptions expressed in words associated with that character. That tool can then look at other texts and identify masculine or feminine characters based on what it knows from having been trained before. The implications are that scholars can therefore use texts from different time periods with the tool to study representations of masculinity and femininity over time. No licensed content, no licensed or copyrighted books from a publisher can ever be released to the world by sharing the trained tool; the trained tool is merely capable of topic modeling—but the publisher’s above language would prohibit its dissemination nevertheless. 

Tools that could neither be trained nor disseminated 

  1. In 2019, authors used text from millions of books published over 100 years to analyze cultural meaning. They did this by training third-party non-generative AI word-embedding models called Word2Vec and GLoVE on multiple textual archives. The tools cannot reproduce content: when shown new text, they merely represent words as numbers, or vectors, to evaluate or predict how similar words in a given space are semantically or linguistically. The similarity of words can reveal cultural shifts in understanding of socioeconomic factors like class over time. But the publisher’s above licensing terms would prohibit the training of the tools to begin with, much less the sharing of them to support further or different inquiry. 
  2. In 2023, scholars trained a third-party-created open-source natural language processing (NLP) tool called Chemical Data Extractor (CDE). Among other things, CDE can be used to extract chemical information and properties identified in scholarly papers. In this case, the scholars wanted to teach CDE to parse a specific type of chemical information: metal-organic frameworks, or MoFs. Generally speaking, the CDE tool works by breaking sentences into “tokens” like parts of speech and referenced chemicals. By correlating tokens, one can determine that a particular chemical compound has certain synthetic properties, topologies, reactions with solvents, etc. The scholars trained CDE specifically to parse MoF names, synthesis methods, inorganic precursors, and more—and then exported the results into an open source database that identifies the MoF properties for each compound. Anyone can now use both the trained CDE tool and the database of MoF properties to ask different chemical property questions or identify additional MoF production pathways—thereby improving materials science for all. Neither the CDE tool nor the MoF database reproduces or contains the underlying scholarly papers that the tool learned from. Yet, neither the training of this third-party CDE tool nor its dissemination would be permitted under the publisher’s restrictive licensing language cited above.

Indeed, there are hundreds of AI tools that scholars have trained and disseminated—tools that do not reproduce licensed content—and that scholars have created or fine-tuned to extract chemical information, recognize faces, decode conversations, infer character types, and so much more. Restrictive licensing language like that shown above suppresses research inquiries and societal benefits that these tools make possible. It may also disproportionately affect the advancement of knowledge in or about developing countries, which may lack the resources to secure licenses or be forced to rely on open-source or poorly-coded public data—hindering journalism, language translation, and language preservation.

Protecting access to facts

Why are some publishers doing this? Perhaps to reserve the opportunity to develop and license their own scholarship-trained AI tools, which they could then license at additional cost back to research institutions. We could speculate about motivations, but the upshot is that publishers have been pushing hard to foreclose scholars from training and dissemination AI tools that now “know” something based on the licensed content. That is, such publishers wish to prevent tools from learning facts about the licensed content. 

However, this is precisely the purpose of licensing content. When institutions license content for their scholars to read, they are doing so for the scholars to learn information from the content. When scholars write about it or teach about the content, they are not regenerating the actual expression from the content—the part that is protected by copyright; rather the scholars are conveying the lessons learned from the content—facts not protected by copyright. Prohibiting the training of AI tools and the dissemination of those tools is functionally equivalent to prohibiting scholars from learning anything about the content that institutions are licensing for that very purpose, and that scholars have written to begin with! Publishers should not be able to monopolize the dissemination of information learned from scholarly content, and especially when that information is used non-commercially.

For these reasons, when we negotiate to preserve AI usage and training rights, we generally try to achieve the following outcomes which would promote—rather than prohibit—all of the research projects described above:

The sample language we’ve disseminated empowers others to negotiate for these outcomes. We hope that, when coupled with the advocacy tools we’ve provided above, scholars and libraries can protect their AI usage and training rights, while also being equipped to consider how they want their own works to be used.

The DMCA 1201 Rulemaking: Summary, Key Takeaways, and Other Items of Interest

Posted November 8, 2024

Last month, we blogged about the key takeaways from the 2024 TDM exemptions recently put in place by the Librarian of Congress, including how the 2024 exemptions (1) expand researchers’ access to existing corpora, (2) definitively allow the viewing and annotation of copyrighted materials for TDM research purposes, and (3) create new obligations for researchers to disclose security protocols to trade associations. Beyond these key changes, the TDM exemptions remain largely the same: researchers affiliated with universities are allowed to circumvent TPMs to compile corpora for TDM research, provided that those copies of copyrighted materials are legally obtained and adequate security protocols are put in place.

We have since updated our resources page on Text and Data Mining and have incorporated the new developments into our TDM report: Text and Data Mining Under U.S. Copyright Law: Landscape, Flaws & Recommendations.

In this blog post, we share some further reflections on the newly expanded TDM exemptions—including (1) the use of AI tools in TDM research, (2) outside researchers’ access to existing corpora, (3) the disclosure requirement, and (4) a potential TDM licensing market—as well as other insights that emerged during the 9th triennial rulemaking.

The TDM Exemption

In other jurisdictions, such as the EU, Singapore, and Japan, legal provisions that permit “text data mining” also allow a broad array of uses, such as general machine learning and generative AI model training. In the US, exemptions allowing TDM so far have not explicitly addressed whether AI could be used as a tool for conducting TDM research. In this round of remaking, we were able to gain clarity on how AI tools are allowed to aid TDM research. Advocates for the TDM exemptions provided ample examples of how machine learning and AI are key to conducting TDM research and asked that “generative AI” not be deemed categorically impermissible as a tool for TDM research. The Copyright Office agreed that a wide array of tools could be utilized for TDM research under the exemptions, including AI tools, as long as the purpose is to conduct “scholarly text and data mining research and teaching.” The Office was careful to limit its analysis to those uses and not address other applications such as compiling data—or reusing existing TDM corpora—for training generative AI models; those are an entirely separate issue from facilitating non-commercial TDM research.

Besides clarifying that AI tools are allowed for TDM research and that viewing and annotation are permitted for copyrighted materials, the new exemptions offer meaningful improvement to TDM researchers’ access to corpora. The previous 2021 exemptions allowed access for purposes of “collaboration,” but many researchers interpreted that narrowly, and the Office confirmed that “collaboration” was not meant to encompass outside research projects entirely unrelated to the original research for which the corpus was created. Under the 2021 exemptions, a TDM corpus could only be accessed by outside researchers if they are working on the same research project as the original compiler of the corpus. The 2024 exemptions’ expansion of access to existing corpora has two main components and advantages. 

The expansion now allows for new research projects to be conducted on existing corpora, permitting institutions that have created a corpus to provide access “to researchers affiliated with other nonprofit institutions of higher education, with all access provided only through secure connections and on the condition of authenticated credentials, solely for purposes of text and data mining research or teaching.” At the same time, it also opens up new possibilities for researchers at institutions who otherwise would not have access, as the new exemption does not require a precondition that the outside researchers’ institutions otherwise own copies of works in the corpora. The new exemptions pose some important limitations: only researchers at institutions of higher education are allowed this access, and nothing more than “access” is allowed—it does not, for example, allow the transfer of a corpus for local use. 

The Office emphasized the need for adequate security protections, pointing back to cases such as Authors Guild v. Google and Authors Guild v. HathiTrust, which emphasized how careful both organizations were, respectively, to prevent their digitized corpora from being misused. To take advantage of this newly expanded TDM exemption, it will be crucial for universities to provide adequate IT support to ensure that technical barriers do not impede TDM researchers. That said, the record for the exemption shows that existing users are exceedingly conscientious when it comes to security. There have been zero reported instances of security breaches or lapses related to TDM corpora being compiled and used under the exemptions. 

As we previously explained, the security requirements are changed in a few ways. The new rule clarifies that trade associations can send inquiries on behalf of rightsholders. However, inquiries must be supported by a “reasonable belief” that the sender’s works are in a corpus being used for TDM research. It remains to be seen how the new obligation to disclose security measures to trade associations would impact TDM researchers and their institutions. The Register circuitously called out demands by trade associations sent to digital humanities researchers in the middle of the exemption process with a two-week response deadline as unreasonable and quoted NTIA (which provides input on the exemptions) in agreement that  “[t]he timing, targeting, and tenor of these requests [for institutions to disclose their security protocols] are disturbing.”  We are hopeful that this discouragement from the Copyright Office will prevent any future large-scale harassment towards TDM researchers and their institutions, but we will also remain vigilant in case trade associations were to abuse this new power. 

Alongside the concerns over disclosure requirements, we have some questions about the Copyright Office’s treatment of fair use as a rationale for circumventing TPMs for TDM research. The Register restated her 2021 conclusion that “under Authors Guild, Inc. v. HathiTrust, lost licensing revenue should only be considered ‘when the use serves as a substitute for the original.’” The Office, in its recommendations, placed considerable weight on the lack of a viable licensing market for TDM, which raises a concern that, in the Office’s view, a use that once was fair and legal might lose that status when the rightsholder starts to offer an adequate licensing option. While this may never become a real issue for the existing TDM exemptions (because no sufficient licensing options exist for TDM researchers, and for the breadth and depth of content needed, it seems unlikely to ever develop), it nonetheless contributes to the growing confusion surrounding the stability of a fair use defense in the face of new licensing markets. 

These concerns highlight the need for ongoing advocacy in the realm of TDM research. Overall, the Register of Copyright recognizes TDM as “a relatively new field that is quickly evolving.” This means that we could ask the Library of Congress to relax the limitations placed on TDM if we can point to legitimate research-related purposes. But, due to the nature of this process, it also means TDM researchers do not have a permanent and stable right to circumvent TPMs. As the exemptions remain subject to review every three years, many large trade associations advocate for the TDM exemptions to be greatly limited or even canceled, wishing to stifle independent TDM research. We will continue to advocate for TDM researchers, as we did during the 8th and 9th triennial rulemaking. 

Looking beyond the TDM exemption, we noted a few other developments: 

Warhol has not fundamentally changed fair use

First, the Opponents of the renewal of the existing exemptions repeatedly pointed to Warhol Foundation v. Goldsmith—the Supreme Court’s most recent fair use opinion—to argue that it has changed the fair use analysis such that the existing exemptions should not be renewed. For example, the Opponents argued that the fair use analysis for repairing medical devices changed under Warhol because, according to them, commercial nontransformative uses were less likely to be fair. The Copyright Office did not agree. The Register said that the same fair use analysis as in 2021 applied and that the Opponents failed “to show that the Warhol decision constitutes intervening legal precedent rendering the Office’s prior fair use analysis invalid.” In another instance where the Opponents tried to argue that commerciality must be given more weight under Warhol, the Register pointed out that under Warhol commerciality is not dispositive and must be weighed against the purpose of the new use.  The arguments for revisiting the 2021 fair use analyses were uniformly rejected, which we think is good news for those of us who believe Warhol should be read as making a modest adjustment to fair use and not a wholesale reworking of the fair use doctrine. 

Does ownership and control of copies matter for access? 

One of the requests before the Office was an expansion of an exemption that allows for access to preservation copies of computer programs and video games. The Office rejected the main thrust of the request but, in doing so, also provided an interesting clarification that may reveal some of the Office’s thinking about the relationship between fair use and access to copies owned by the user: 

The Register concludes that proponents did not show that removing the single user limitation for preserved computer programs or permitting off-premises access to video games are likely to be noninfringing. She also notes the greater risk of market harm with removing the video game exemption’s premises limitation, given the market for legacy video games. She recommends clarifying the single copy restriction language to reflect that preservation institutions can allow a copy of a computer program to be accessed by as many individuals as there are circumvented copies legally owned.”

That sounds a lot like an endorsement of the idea that the owned-to-loaned ratio, a key concept in the controlled digital lending analysis, should matter in the fair use analysis (which is something the Hachette v. Internet Archive controlled digital lending court gave zero weight to). For future 1201 exemptions, we will have to wait and see whether the Office will use this framework in other contexts. 

Addressing other non-copyright and AI questions in the 1201 process

The Librarian of Congress’s final rule included a number of notes on issues not addressed by the rulemaking: 

“The Librarian is aware that the Register and her legal staff have invested a great deal of time over the past two years in analyzing the many issues underlying the 1201 process and proposed exemptions. 

Through this work, the Register has come to believe that the issue of research on artificial intelligence security and trustworthiness warrants more general Congressional and regulatory attention. The Librarian agrees with the Register in this assessment. As a regulatory process focused on technological protection measures for copyrighted content, section 1201 is ill-suited to address fundamental policy issues with new technologies.” 

Proponents tried to argue that the software platforms’ restrictions and barriers to conducting AI research, such as their account requirements, rate limits, and algorithmic safeguards, are circumventable TPMs under 1201, but the Register disagreed. The Register maintained that the challenges Proponents described arose not out of circumventable TPMs but out of third-party controlled Software as a Service platforms. This decision can be illuminating for TDM researchers seeking to conduct TDM research on online streaming media or social media posts.

The Librarian’s note went on to say: “The Librarian is further aware of the policy and legal issues involving a generalized ‘‘right to repair’’ equipment with embedded software. These issues have now occupied the White House, Congress, state legislatures, federal agencies, the Copyright Office, and the general public through multiple rounds of 1201 rulemaking. 

Copyright is but one piece in a national framework for ensuring the security, trustworthiness, and reliability of embedded software, as well as other copyright-protected technology that affects our daily lives. Issues such as these extend beyond the reach of 1201 and may require a broader solution, as noted by the NTIA.”

These notes give an interesting, though a bit confusing, insight into how the Librarian of Congress and the Copyright Office think about the role of 1201 rulemaking when they address issues that go beyond copyright’s core concerns. While we can agree that 1201 is ill-suited to address fundamental policy issues with new technology, it is also somewhat concerning that the Office and the Librarian view copyright more generally as part of a broader “national framework for ensuring the security, trustworthiness, and reliability of embedded software.”  While of course, copyright is sometimes used to further ends outside of its intended purpose, these issues are far from the core constitutional purpose of copyright law and we think they are best addressed through other means. 

Text Data Mining Research DMCA Exemption Renewed and Expanded

Posted October 25, 2024
U.S. Copyright Office 1201 Rulemaking Process, taken from https://www.copyright.gov/1201/

Earlier today, the Library of Congress, following recommendations from the U.S. Copyright Office, released its final rule adopting exemptions to the Digital Millenium Copyright Act’s prohibition on circumvention of technological protection measures (e.g.,  DRM).  

As many of you know, we’ve been working closely with members of the text and data-mining community as well as our co-petitioners, the Library Copyright Alliance (LCA) and the American Association of University Professors (AAUP), to petition for renewal of the existing TDM research exemption and to expand it to allow researchers to share their research corpora with other researchers outside of their university (something not previously allowed). The process began over a year ago and followed an in-depth review process by the U.S. Copyright Office, and we’re incredibly grateful for the expert legal representation before the Office over this past year by UC Berkeley Law’s Samuelson Law, Technology & Public Policy Clinic, and in particular clinic faculty Erik Stallman, Jennifer Urban and Berkeley Law students Christian Howard-Sukhil, Zhudi Huang, and Matthew Cha.

We are very pleased to see that the Librarian of Congress both approved the renewal of the existing exemption and approved an expansion that allows for research universities to provide access to TDM corpora for use by researchers at other universities. 

The expanded rule is poised to make an immediate impact in helping the TDM researchers collaborate and build upon each other’s work. As Allison Cooper,  director of Kinolab and Associate Professor of Romance Languages and Literatures and Cinema Studies at Bowdoin College, explains:

“This decision will have an immediate impact on the ongoing close-up project that Joel Burges, Emily Sherwood, and I are working on by allowing us to collaborate with researchers like David Bamman, whose expertise in machine learning will be valuable in answering many of the ‘big picture’ questions about the close-up that have come up in our work so far.”

These are the main takeaways from the new rule: 

  • The exemption has been expanded to allow “access” to corpora by researchers at other institutions “solely for purposes of text and data mining research or teaching.” There is no more requirement that access be granted as part of a “collaboration,” so new researchers can ask new and different questions of a corpus. Access must be credentialed and authenticated.
  • The issue of whether a researcher can engage in “close viewing” of a copyrighted work has been resolved—as the explanation for the revised rule puts it, researchers can “view the contents of copyrighted works as part of their research, provided that any viewing that takes place is in furtherance of research objectives (e.g., processing or annotating works to prepare them for analysis) and not for the works’ expressive value.” This is a very helpful clarification!
  • The new rule also modified the existing security requirements, which provide that researchers must put in place adequate security protocols to protect TDM corpora from unauthorized reuse and must share information about those security protocols with rightsholders upon request. That rule has been limited in some ways and expanded in others. The new rule clarifies that trade associations can send inquiries on behalf of rightsholders. However, inquiries must be supported by a “reasonable belief” that the sender’s works are in a corpus being used for TDM research.

Later on, we will post a more in-depth analysis of the new rules–both TDM and others that apply to authors. The Librarian of Congress also authorized the renewal of a number of other rules that support research, teaching, and library preservation. Among them is a renewal of another exemption that Authors Alliance and AAUP petitioned for, allowing for the circumvention of digital locks when using motion picture excerpts in multi-media ebooks. 

Thank you to all of the many, many TDM researchers and librarians we’ve worked with over the last several years to help support this petition. 

You can learn more about TDM and our work on this issue through our TDM resources page, here.

Authors Alliance Submits Long-Form Comment to Copyright Office in Support of Petition to Expand Existing Text and Data Mining Exemption 

Posted January 29, 2024
Photo by Simona Sergi on Unsplash

Last month, Authors Alliance submitted detailed comments in response to the Copyright Office’s Notice of Proposed Rulemaking in support of our petition to expand the existing Digital Millennium Copyright Act (DMCA) exemptions that enable text and data mining (TDM) as part of this year’s §1201 rulemaking cycle

To recap: our expansion petitions ask the Copyright Office to modify the existing TDM exemption so that researchers who assemble corpora of ebooks or films on which to conduct text and data mining are able to share that corpus with other academic researchers, where this second group of researchers qualifies under the exemption. Under the current exemption, academic researchers are only able to share their corpora with other qualified researchers for purposes of “collaboration and verification.” This simple change would eliminate the need for duplicative efforts to remove digital locks from ebooks and films, a time and resource-intensive process, broadening the group of academic researchers who are able to use the exemption. 

Our comment argues that the existing TDM exemption has begun to enable valuable digital humanities research and teaching, but that the proposed expansion would go much further towards enabling this research and helping TDM researchers reach their goals. The comment is accompanied by 13 letters of support from researchers, educators, and funding organizations, highlighting the research that has been done in reliance on the exemption, and explaining why this expansion is necessary. Our thanks go out to our stellar clinical team at UC Berkeley’s Samuelson Law, Technology & Public Policy Clinic—law students Mathew Cha and Zhudi Huang, and clinical supervisor Jennifer Urban—for writing and submitting this comment on our behalf. We are also grateful to our co-petitioners, the Library Copyright Alliance and American Association of University Professors, for their support on this comment. 

Ambiguity in “Collaboration”

One reason the expansion is necessary is the uncertainty over what constitutes “collaboration” under the existing exemption. Researchers have open questions about what level of individual contribution to a project would make researchers “collaborators” under the exemption. As our comment explains, collaboration can come in a number of different forms, from “formal collaborations under the auspice of a grant, [to] ad hoc collaborations that result from two teams discovering that they are working on similar material to the same ends, or even discussions at conferences between members of a loose network of scholars working on the same broad set of interests.” But it is not clear which of these activities is “collaboration” for the purposes of the exemption. And this uncertainty has had a chilling effect on the socially valuable research made possible by the exemption. 

Costly Corpora Creation 

Our comment also highlights the vast costs that go into creating a usable corpus for TDM research. Institutions whose researchers are conducting TDM research pursuant to the exemption must lawfully own the works in question, or license them through a license that is not time-limited. But these costs pale in comparison to the required computing resources—a cost which is compounded by the exemption’s strict security requirements—and human labor involved in bypassing technical protection measures and assembling a corpus. Moreover, it’s important to recognize that there is simply not a tremendous amount of grant funding or even institutional support available to TDM researchers. 

Because corpora are so costly to assemble and create, we believe it to be reasonable to permit researchers to share their corpora with researchers at other institutions who want to conduct independent TDM research on these corpora. As the exemption currently stands, researchers interested in pre-existing corpora must duplicate the efforts of the previous researchers, incurring massive costs along the way. We’ve already seen indications that these costs can lead researchers to avoid certain research questions and areas of study altogether. As our comment explains, this “duplicative circumvention” can be avoided by changing the language of the exemption to permit corpora sharing between qualified researchers at separate institutions. 

Equity Issues

Worse still, not all institutions are able to bear these expenses. Our comment explains how the current exemption’s prohibition on sharing beyond collaboration and verification—and consequent duplication of prior labor—-”create[s] barriers that can prevent smaller and less-well-resourced institutions from conducting TDM research at all.” This creates inequity in what type of institutions can support TDM projects, and what types of researchers can conduct them. The unfortunate result has been that large institutions that have “the resources to compensate and maintain technical staff and infrastructure” are able to support TDM research under the exemption, while smaller institutions are not. 

Values of Corpora Sharing

Our comment explains how allowing limited sharing of corpora under the exemption would go a long way towards lowering barriers to entry for TDM research and ameliorating the equity issues described above. Since digital humanities is already an under-resourced field, the effects of enabling researchers to share their corpora with other academic researchers could be quite profound. 

Researchers who wrote letters in support of the petition described a multitude of exciting projects, and have built “a rich set of corpora to study, such as a collection of fiction written by African American writers, a collection of books banned in the United States, and a curated corpus of movies and television with an ‘emphasis on racial, ethnic, sexual, and gender diversity.’” Many of those who wrote letters in support of our petition recounted requests they’ve gotten from other researchers to use their corpora, and who were frustrated that the exemption’s prohibition on non-collaborative sharing and their limited capacity for collaboration prevented them from sharing these corpora. 

Allowing new researchers with new research questions to study these corpora could reveal new insights about these bodies of work. As we explain, “in the same way a single literary work or motion picture can evince multiple meanings based on the lens of analysis used, when different researchers study one corpus, they are able to pose different research questions and apply different methodologies, ultimately revealing new and original findings . . . . Enabling broader sharing and thus, increasing the number of researchers that can study a corpus, will allow a body of works to be better understood beyond the initial ‘limited set of research questions.’”

Fair Use

The 1201 rulemaking process for exemptions to DMCA § 1201’s prohibition on breaking digital locks requires that the proposed activity be a fair use. In the 2021 proceedings, the Office recognized TDM for research and teaching purposes as a fair use. Because the expansion we’re seeking is relatively minor, our comment explains that the types of uses we are asking the Office to permit researchers to make is also fair use. Our comment explains that each of the four fair use factors favor fair use in the context of the proposed expansion. We further explain why the enhanced sharing the expansion would provide does not harm the market for the original works under factor four: because institutions must lawfully own (or license under a non-time-limited license) the works that their researchers wish to conduct TDM on, it makes no difference from a market standpoint whether researchers bypass technical protection measures themselves, or share another institution’s corpus. Copyright holders are not harmed when researchers at one institution share a corpus created by researchers at another institution, since both institutions must purchase the works in order to be eligible under the exemption. 

What’s Next?

If there are parties that oppose our proposed expansion, they have until February 20th to submit opposition comments to the Copyright Office. Then, on March 19th, our reply comments to any opposition comments will be due. We will keep our readers and members apprised as the process continues to move forward.

Copyright Office Recommends Renewal of the Existing Text Data Mining Exemptions for Literary Works and Films

Posted October 19, 2023
Photo by Tim Mossholder on Unsplash

Authors Alliance is delighted to announce that the Copyright Office has recommended that the Librarian of Congress renew both of the exemptions to DMCA liability for text and data mining in its Notice of Proposed Rulemaking for this year’s DMCA exemptions, released today. While the Librarian of Congress could technically disagree with the recommendation to renew, this rarely if ever happens in practice. 

Renewal Petitions and Recommendations

Authors Alliance petitioned the Office to renew the exemptions in July, along with our co-petitioners the American Association of University Professors and the Library Copyright Alliance. Then, the Office entertained comments from stakeholders and the public at large who wished to make statements in support of or in opposition to renewal of the existing exemptions, before drawing conclusions about renewal in today’s notice. 

The Office did not receive any comments arguing against renewal of the TDM exemption for literary works distributed electronically; our petition was unopposed. The Office agreed with Authors Alliance and our co-petitioners, ARL and AAUP, observing that “researchers are actively relying on the current exemption” and citing to an example of such research that we highlighted in our petition. Apparently agreeing with our statement that there have not been “material changes in facts, law, technology, or other circumstances” since the 1201 rulemaking cycle when the exemption was originally obtained, the Office stated it intended to recommend that the exemption be renewed. 

Our renewal petition for the text and data mining exemption for motion pictures, which is identical to the literary works exemption in all aspects but the type of works involved, did receive one opposition comment, but the Copyright Office found that it did not meet the standard for meaningful opposition, and recommended renewal. DVD CCA (the DVD Copyright Control Association) and AACS LA (the Advanced Access Content System Licensing Administrator) submitted a joint comment arguing that a statement in our petition indicated that there had been a change in the facts surrounding the exemption. More specifically, they argued that our statement that “[c]ommercially licensed text and data mining products continue to be made available to research institutions” constituted an admission that new licensed databases motion pictures had emerged since the previous rulemaking. DVD CCA and AACS LA did not actually offer any evidence of the emergence of new licensed databases for motion pictures. We believed this opposition comment was without merit—while licensed databases for text and data mining of audiovisual works are not as prevalent as licensed databases for text and data mining of text-based works, some were available during the 2021 rulemaking, and continue to be available today. We are pleased that the Office agreed, citing to the previous rulemaking record as supporting evidence.

Expansions and Next Steps

In addition to requesting that the Office renew the current exemptions, we (along with AAUP and LCA) also requested that the Office consider expanding these exemptions to enhance a researcher’s ability to share their corpus with other researchers that are not their direct collaborators. The two processes run in parallel, and today’s announcement means that even if we do not ultimately obtain expanded exemptions, the existing exemptions are very likely to be renewed. 

In its NPRM, the Office also announced deadlines for the various submissions that petitions for expansions and new exemptions will require. The first round of comments in support of  our proposed expansion—including documentary evidence from researchers who are being adversely affected by the limited sharing permitted under the existing exemptions—will be due December 22nd. Opposition comments are due February 20, 2024. Reply comments to these opposition comments are then due March 24, 2024. Then, later in the spring, there will be a hearing with the Copyright Office regarding our proposed expansion. We will—as always—keep our readers apprised as the process moves forward.