On March 3, librarians, authors, publishers, and technologists gathered at Northeastern University Library in Boston to contribute to a startup plan forThe Public Interest Corpus. The Public Interest Corpus is focused on supporting the creation of high-quality AI training data from memory organizations (e.g., libraries, archives, museums) and their partners (e.g., publishers) that advance the public interest. For too long, access to high-quality training data has been limited to the world’s most well-resourced organizations pushing others toward data of lesser quality, comprehensiveness, and dubious legality. Over the course of our day together, event participants made strong contributions to the development of The Public Interest Corpus startup plan.
Refining Principles and Goals
We began the day with an exercise focused on refining The Public Interest Corpus principles and goals. We felt this was a good place to start given that principles and goals held in common provide the foundation for collective action. Participants contributed a broad range of comments, edits, and suggestions that greatly strengthened project principles and goals. The project team is in the process of versioning this document and will have more to share down the line.
In the interim, we share some takeaways:
The Public Interest Corpus should maximize transparency. Participants called for transparency around corpus composition and emphasized how this could support reproducible research and the development of AI that is pluralistic and grounded in particular social contexts.
“Public Interest” is a compelling framing that needs to be made more concrete. Participants expressed a need for more specificity regarding target user communities served by The Public Interest Corpus and strategies that effectively balance public interests and commercial interests in The Public Interest Corpus.
Workshopping Core Challenges
Following the principles and goals activity, we broke participants into mixed stakeholder (author, publisher, legal expert, librarian, technical expert) groups. Group composition was shuffled once more in the afternoon to encourage continued novelty in ideation. Each group was presented with a set of questions to respond to that aligned with the following challenge areas: (1) Target Audiences, Training Data Needs, Potential Partnerships, (2) Legal and Policy, and (3) Business Model, Sustainability, and Governance. As with the principles and goals exercise, the project team is actively processing the product of group activity and will have more to share in the future.
In the interim, we share some takeaways:
Target Audiences, Training Data Needs, Potential Partnerships
Accessing in-copyright books for AI training purposes is extremely difficult for researchers. Challenges include but are not limited to downstream impact of contractual override, organizational uncertainty in making fair use determinations, and multiple active court cases testing AI training as a fair use arguments in the United States.
Focus on simple solutions. Multiple participants suggested that focusing on a simple solution was the best path forward – i.e., identify the most compelling minimum viable product and deliver on it. A solution could become more complex over time through phased development informed by user community studies.
Legal and Policy Challenges
Balancing what is legally permissible vs. meeting normative community expectations is essential. While it may be the case that AI training on in-copyright works is a fair use, this does not mean that a proposed solution should make works available for training without author or publisher engagement. This effort can learn from engagement with author communities to assess their views and preferences regarding the use of their work for AI training purposes. Authors Alliance continually engages with authors on this issue.
Pending court cases do not provide an insurmountable barrier to a solution. Though active copyright AI litigation is likely to continue for many years, participants believe there are a range of strategies that can be pursued that mitigate legal risk and support development of a solution that advances the public interest.
Business Model, Sustainability, and Governance
The Public Interest Corpus should develop multiple prospective business models and test for viability with stakeholders. Participants have indicated that it would be useful for stakeholders to engage with a range of business models with different revenue streams – e.g., membership model, philanthropically supported, commercially supported, hybrid, etc. Participants suggested paths forward that led to creation of a standalone organization or integration within an existing organization.
With an eye toward mission and policy alignment, business models should differentiate between noncommercial and commercial use of The Public Interest Corpus. Potential service costs should be responsive to resource disparities between non-commercial and commercial users. Potential services costs should also be responsive to resource variation within a prospective non-commercial user base.
Next Steps
We plan to continue engaging core challenges with stakeholders at our next workshop, to be held July 2025 in New York City. If you work in the region and are interested in potentially attending please let us know here.
In addition to the July 2025 workshop, we will present on the project and/or hold additional working events across North America. The next presentation will be at the Coalition for Networked Information meeting in Milwaukee in April. To keep track of future community engagements please refer to our engagements page.
A Recent Entrance to Paradise, an image generated by Steven Thaler’s “Creativity Machine.”
Yesterday, the U.S. Court of Appeals for the District of Columbia Circuit issued its ruling in Thaler v. Perlmutter, a case centered on the question of whether a non-human author, without any intervention from a human, could be an author and hold copyright under the U.S. Copyright Act. The court found that a non-human machine could not be an author under the Act.
In virtually every way, this decision should not be surprising. While it is absolutely conceivable that the product of AI and human collaboration may result in copyrightable works, it is well settled law that non-human authorship is not recognized under the U.S. Copyright Act. This opinion is mostly a repetition of the positions taken by the U.S. Copyright Office in its denial of registration.
That acknowledged, there are some points worth highlighting from the opinion:
First, the court centers much of its analysis on the text of the Copyright Act and the myriad ways in which the statutory language is dependent on humans as authors. Taken together, the Act is unarguably one that is built upon the premise of human authorship. The court says: “All of these statutory provisions collectively identify an “author” as a human being. Machines do not have property, traditional human lifespans, family members, domiciles, nationalities, mentes reae, or signatures.”
Part of the court’s analysis is focused on whether the public would benefit from granting copyright to machine-authored works and ultimately concludes that it would not. The court says: “But the Supreme Court has long held that copyright law is intended to benefit the public, not authors. Copyright law “makes reward to the owner a secondary consideration. ‘[T]he primary object in conferring the monopoly lie[s] in the general benefits derived by the public from the labors of authors.’”
It is important to remember that this opinion is only about the narrow question of whether a machine, working in isolation and with no human intervention, can be considered the author of a work. We should be careful not to try to extend this opinion beyond that. “Those line-drawing disagreements over how much artificial intelligence contributed to a particular human author’s work are neither here nor there in this case. That is because Dr. Thaler listed the Creativity Machine as the sole author of the work before us, and it is undeniably a machine, not a human being.”
Finally, the district court found that Dr. Thaler had waived the argument that, as creator of the Creativity Machine, he was the work’s author. The Court of Appeals found that Dr. Thaler had not challenged that waiver and that it therefore could not address the question of whether works generated by Artificial Intelligence might be authored by the creator of the AI. (“Dr. Thaler argues that he is the work’s author because he made and used the Creativity Machine. We cannot reach that argument.”) This leaves some ambiguity as to whether a future creator of an AI might successfully claim copyright in a work themselves. It also leaves open questions where the human user of AI claims to be the author of an AI-generated work or portions of a work. This is the question the court will have to address head-on in Allen v. Perlmutter, a case currently pending in Colorado. We will continue to watch this space, and share with you any new developments.
Ultimately, the Thaler v. Perlmutter decision is limited to the fact that a machine cannot be an author under copyright law. This is a sensible result and consistent with sound public policy.
We’ve heard from lots of authors with questions about AI licensing of their works by their publishers. Cambridge University Press is one that has been in the news because it has undertaken a project to ask authors to opt into a contract addendum that would allow CUP to license AI rights for their books, giving authors a royalty on AI licensing net revenue. Cambridge has shared an FAQ with authors already, along with a further explanation of its approach last September and a report in January highlighting that it had contacted some 17,000 authors, the majority of whom have opted in.
Below is an interview with Ben Denne, Director of Publishing, Academic Books, at Cambridge University Press, answering some questions about the program.
Dave: Thank you, Ben, for talking with me. To start off, could you say what your role is at Cambridge University Press?
Ben: I’m the Director of Publishing for the Academic Books part of the Academic Division of Cambridge. In short, I’m the director overseeing the whole of the books program, the Academic Books program for Cambridge, except for the Bibles. That’s a specialist unit that runs separately that I don’t have anything to do with, but that means our textbooks, our research and reference books, and then we have a kind of small program of more traditional academic titles that sell a bit more to a bit of a wider audience.
Dave: Thanks. My interest in talking with you is about generative AI licensing. And we’ve had quite a few authors actually forward us some emails that they’ve gotten from Cambridge presenting an AI license addendum to sign that goes with their contract and also an FAQ. I’d like to ask just a few questions about how that’s going and how that works.
What are Cambridge University Press’ goals with AI licensing?
Ben: That’s a really good question. Broadly speaking, when this started to come our way, which was the same time a couple of years ago as this subject became really noisy. We’re looking at it and thinking, what’s the best way through this? How do we appropriately engage in this conversation? And I think it came back to us thinking about encouraging responsible use and thinking about our role as an academic publisher.
And I think our role as an academic publisher is to push the academic debate forward, which means that we want our authors’ books to get read. We want them to get used. We want them to get cited. I think that’s really the kind of spirit we came into this conversation with is thinking, these developments are happening, right, that they’re happening anyway and the best thing we can do as a publisher is try and engage with this debate and push it in a direction that we think really helps to underline those principles of how good research is done.
Dave: One of the things that I’ve seen with CUP’s rollout with this asking authors is, first of all, that you are asking authors. Could you talk me through that decision? We’ve seen some other publishers in the news just announce that they have licensing deals with technology companies, and there was no outreach to authors as far as we can tell from those publishers. So could you talk through that thought process of this outreach?
Ben: Sure, so for us, when we first looked at this, we have a contract that authors sign, which is, probably in many ways, very similar to contracts that they signed with other publishers, and it includes all sorts of clauses about use and wide ranging licensing rights. And one of the things it covers is derivative uses for content and the right to make derivatives. Our sense with that is when we looked at this in the context of these AI conversations and licensing, from a legal perspective, we looked at that and thought, well actually that derivative use clause technically does cover us for this kind of work. And I’m sure that’s the conclusion that some other people have reached too.
But we also thought, it just feels a bit like nobody knew that this kind of technology was emerging when they signed those contracts. And so from our perspective, we thought there’s a lot of noise about this subject in the whole ecosystem right now, you know, you can’t read the news without reading about AI, and people are nervous about it, understandably, and all of those kinds of things. So we felt that we should treat this as additional consent and approach it in that spirit. And that really underpins the decision to go out with the addendum for existing contracts.
I don’t want to jump onto any of your other questions, but that kind of principle, that we were going to ask for opt-ins, was important. Authors have to actively opt into this. We’re not saying to them. “if we don’t hear from you, we’ll assume you’ve opted in.” They have to actually come back to us and say that they’re happy for that use to happen.
Dave: I think one of the things a lot of people don’t think about is how complicated rights clearance is, especially at scale, across a title list that is the size that you have. So this seems to me like a pretty big investment in just doing this process. Could you say how many of these you have sent out? I gather that you’re doing this in batches, but do you have a sense of the scale of how many author addendum requests you anticipate making over the course of however long this process lasts?
Ben: It’s a really good question and it’s a moving target. At this stage, we have sent out multiple thousands. But I think we have about 45,000 books available in print and digitally at the moment. And we’re working our way through that list systematically. So we’re in the thousands and you’re right. It is a pretty big undertaking you know it’s quite a logistical challenge to do. We had to set up a whole kind of new workflow for doing this. We have a team that are working on the addenda and addressing the questions that authors have and all of those kinds of things.
Dave: This is maybe getting in the weeds, but it seems to me like there’s a pretty big difference between figuring this out for a sole-authored, single-part monograph, for instance, which is mostly what I’ve seen come through, and edited volumes. Have you tried to figure out those more complex books with multiple authors, multiple works within them?
Ben: Yeah, so the way it’s working for us is where we have several contracted authors for a book, we’re contacting them all and all of those authors have to opt in in order for us to agree that we have the licensing rights.
For edited volumes with multiple contributors, we’re not contacting the individual contributors for opt-in and there are a couple of different reasons for that. Typically, they don’t get paid royalties and also it would just be impossible for us to do. I mean, that’s logistically, you know, that’s a huge ask. So what we are doing is we are still contacting the editors for those volumes and the editors will opt in or not. So if the editor opts in, our understanding is that they’re opting in on behalf of the contributors as well.
But for multi-authored works, we get in touch with all of them. And in fact, we have quite a sizable number of books which are stuck because we’ve had some authors opt-in and some authors not opt-in.
Dave: This is a pretty fast-moving technology and I think a lot of authors are feeling just uncertain right now. And so I wonder about the opt-in window, if an author declines to opt in right now, is that it? Is there an opportunity to come back later after the dust settles and say, oh, no, actually, you know, I’d be happy to have my work used in this way?
Ben: Yeah, definitely. We’re in the process of putting something in place so that if authors don’t opt in now they are able to come back and opt in later. And by the way, if they don’t opt in, that’s fine for all the reasons that you just said; some people are queasy about this and that’s okay. We’re not trying to, we’re not putting a hard sell on it
My sense with this is that for some of the people that we’re speaking to who haven’t opted in, it is because they haven’t yet really seen what the kind of use cases are for this kind of technology. Perhaps as those become more public, people will want to come back and opt in.
I think some of the things that are out there are going to be quite powerful discovery tools in the future. So we want to make sure the authors do have the opportunity to opt in later if they want to, although we can’t, of course, be sure that if people opt in later the same opportunities will necessarily be available then, since this is quite a fast moving area.
Dave: For your contracts moving forward for front list books, is a clause like this now a default in those agreements or will authors of new books have the option to opt-in or opt-out for AI licensing?
Ben: Good question. Currently, we have put a clause into our contracts to add AI licensing. But, where authors are asking us to remove that clause, we’re taking it out.
And again, coming back to your point before, those authors could opt in later. But for the contracts as they go out, we have it in as a clause now.
Dave: Okay. So let’s shift to if you’re gathering all of these rights from authors, presumably at some point, then you would actually engage in the licensing with technology companies or others. Could you say a little bit about that? Do you have any deals in place with tech companies already? Or, the other thing that I’ve seen is, some publishers have been in the position of not doing those deals directly, but having sort of sub-licensing deals with others- I understand Proquest Clarivate is doing this. And I think Wiley is as well. Do you have any of those deals in place now?
Ben: We’re still having those conversations at the moment. And we are talking to a range of different people who are looking at this kind of content.
Dave: Okay, that’s really helpful to know.
At the beginning, you talked a little bit about Cambridge University Press’s motivations with engaging in this space and doing licensing. Could you talk a little bit about important factors for what might show up in one of those kinds of deals with tech companies? For instance, one of the things that I think aligns with the sort of values that you outlined at the beginning and that authors care a lot about is credit, right? We know that, especially for academic authors, credit is incredibly valuable and important. And so I wonder if you’ve thought about how ensuring author credit might factor into any sort of downstream deal that CUP might engage in?
Ben: Absolutely. So we’re having exactly those conversations at the moment with anybody that we’re talking to. And we’ve been very clear with our authors when they’ve asked questions about this, and you may have seen this alluded to in some of the information that you’ve had forwarded to you from authors, that those principles of attribution are 100% what we’re focused on. Really, they’re kind of a red line for us.
One of the things we’ve been in lots of conversations with people around this technology is the question of at what level does content need to be attributed? Our sense with this is that any kind of meaningful extract from somebody else’s work needs to be cited.
I’m kind of repeating myself, but that’s how research works. People build on other people’s work, and so in a scenario where content is being ‘discovered’, if we can’t identify and cite that content, it can’t be accurately attributed. So that’s a red line for us.
Dave: Right. I think figuring out that attribution, like at what level does that attribution need to kick in, is a really tricky thing. It seems to me, that if you’ve got a foundation model that is pulling in some texts and then someone’s using, say ChatGPT to write emails and somewhere in the model it gleans some structural components from sources like academic books, I don’t think that’s the thing most authors care about – being cited for the fact that you help train this model to understand how to format citations or do other things like that. It’s the intellectual content that matters and that’s the really tricky piece of it.
Ben: Absolutely and I don’t have an easy answer for you there. So we’re having those conversations at the moment, but our sense is that any sort of direct quote, anything that could be, you know, anything that you would consider to be plagiarism or worthy of credit in a non-AI world should be attributed.
Dave: I realize this question is asking a hypothetical because you don’t have any of these agreements in place yet, but it seems to me there’s a pretty big difference between use of Cambridge books for model training and uses such as for Retrieval Augmented Generation (RAG).
Have you thought about those distinctions in terms of how that might affect differences in Cambridge’s willingness to set a price on those things? I assume retrieval-augmented generation (RAG) would come with a higher licensing price than others. But could you talk me through that thought process?
Ben: So it’s kind of interesting because I think there’s a little bit of a gray area, because I think a lot of the RAG tools are combined with some aspect of LLM. So they might belooking to summarize some research or write a brief about X, Y, and Z.
I think it is quite interesting at the moment that most of the questions we get from people who are worried about this are really anxious about LLMs, but I feel like the really exciting place for academia and research is around that kind of retrieval augmented generation because that’s what’s going to help with discoverability for authors. It is difficult to talk about at the moment because we don’t have any public deals that I can point to. But I’d say a lot of the conversations that we’re having are somewhere between those two things, you know, so it’s a combination of an LLM that’s generating text and a citation engine or discovery engine sitting over content.
Dave: Leaving aside the legal situation for a moment, one of the things that I hear from authors pretty consistently is the sentiment that with these big technology companies coming in, they feel that these companies are sort of profiting off of content; that they are exploiting. And so they ought to return something to the system and to authors.
But there’s a really different sentiment about what happens when you have, say, academic researchers using content for AI or text data mining purposes to make new discoveries or learn new things both about the texts and about the world around them. We work a lot with text data mining researchers who are interested in large aggregations of content, not so they can build the next OpenAI, but so they can understand how language has changed over time, or how has culture changed over time.
I wonder from CUP’s perspective, how do those two different kinds of use cases factor into your thinking about downstream licensing deals for AI/ text data mining?
Ben: Yeah, I think for us that the primary thing we’re really trying to lean on, because of course the whole thing is not quite that clear cut, because a lot of the time it’s the big tech companies that are facilitating a lot of that discovery or that a lot of the kind of discovery traffic goes through them. So I think from our perspective, I’m going to say we’re not ruling out working with anyone. We would put anybody– any partner that we had– through the same diligence process that we would have with onboarding anybody else, but we wouldn’t rule out those conversations with anybody. I think for us, the most important thing is coming back to, and I’m going to sound like a stuck record here, but those principles of attribution. And we have had conversations, some preliminary conversations with people who’ve said, “Well, we don’t think it would be possible to do what you’re asking,” and at that point, we’re saying, “well, okay, then you know that’s the red line for us.”
I think there’s quite a bit of cloudy territory between those two things. And I think for us, the most important thing is to make sure that authors are being credited where their work’s being used.
Dave: All right, I have a hypothetical that I wanted to give you. So we see that it’s a 20% royalty calculated on net revenue. Let’s say you received $5 million from an AI licensing deal. Can you walk me through how that might work out for the author? How do you calculate net revenue on that? And then, how that the individual author sitting there sees CUP signs a big deal. What can they expect?
Ben: That’s a tricky one because it would depend a little bit on the terms of the deal as well. But broadly speaking, the principle is, if that’s the net revenues that we receive, so in your situation, you had five million in there, the full licensing payment, is divided out across the list of titles. Authors then earn the royalty for that sale or license type per title, as they do now with all other forms of licensing.
But, then, where a licensee can provide accurate title-level usage within their royalty statements, this would instead be used. So in an LLM situation that you were just talking about, that would be divided among those books. With the retrieval augmented generation tool, I think that would work much more around the basis of usage. So, depending on what searches within that tool were bringing back particular content, then we would be attributing revenue that way.
Dave: Okay, that makes a lot of sense. I think this was in the FAQ: one of your use cases is in an authoritative database that’s used on a perpetual basis. But there was somewhere that talked about the removal of content once a licensing term has ended. I wonder if you’ve developed thinking internally about what a standard term would be, how long these things might last?
Ben: Yeah, I mean, it’s hard, isn’t it? Because where you’re licensing content to train an LLM, it would be sort of insincere to dress that up. Generally most agreements would be governed by a 2-5 year training term and at the end of that term the training data set would be destroyed, however, they would retain the output from the specific models that were developed during the training term. If they wanted to create new models they would need to renew the license/extend the term.
For some of the other uses that’s all being discussed at the moment. I think there is still work on this, but there would be standard partnership length terms. What I would say is that from our perspective, we think it’s quite likely in the next few years, the focus will move more away from training large language models and into that area of discovery that these are going to become quite important revenue streams for academic publishers.
Dave: Thanks, very helpful. As you work on these deals, what level of transparency do you plan on offering authors or the general public about what these licenses might look like? At least with other publishers, it’s been quite mysterious – I think with one, we learned about an AI licensing deal in a quarterly earnings report, for instance. I think authors do really care about what the details of these deals look like.
Ben: It’s tricky, isn’t it? It’s hard for me to talk about a deal that hasn’t been done already, and of course, these deals can be subject to the same commercial confidentiality requirements as any other partnership. But I think it’s fair to say that Cambridge University Press would endeavor to be pretty transparent about what we’re doing generally and most importantly, be transparent about why we’re doing it. So I don’t think we’d be concealing that information from anybody. And coming back to my point before, we’ve been quite clear that we only want to enter into these kinds of conversations with people that we think are using content responsibly, and we’d always aim to be open.
Dave: A few final questions. First, CUP has published a number of open-access books. For example, I believe CUP was part of the TOME initiative. Do you feel like this kind of addendum is necessary for those open-access books, given that they already have some sort of open license attached to them? Or do you think that this is a necessary addition to those OA licenses?
Ben: That’s a really good question, and it’s something that we’re grappling with at the moment. Without getting into the kind of weeds around open access, some of it depends on the license. Historically for books, our default license open access license was a Creative Commons CC BY NC license, which prohibits commercial reuse. I think at the moment, we’re looking at that (and I think a lot of publishers would say the same thing) and working through how that fits with AI licensing with commercial AI companies. The short answer to your question is if you have a CC BY license, then, people do have a broad license to reuse that content. So at the moment, we’re not actively going after those authors for opt-ins, nor are we including those books in licensing deals.
That we’re doing, but that’s also a relatively small number of books. I can say, we are now looking at using more CC-BY-NC-ND as the default, which restricts the creation of derivative works. You’ve touched on a conversation that is evolving, but we would be treating AI usage as requiring a derivative license and therefore not covered under a CC-BY-NC-ND license.
Dave: Thanks, that’s very helpful and I think that’s something a lot of authors are trying to figure out: how does AI downstream use factor into Creative Commons licensed works? And of course, the underlying legal situation matters. I didn’t ask, but I assume that the rights that you’re asking for in this addendum are worldwide, since that affects for example whether usage might be permitted under national law.
Ben: Yes, the rights are worldwide.
And thinking again about that, I mean, it’s interesting, isn’t it? Because even under the CC-BY license, it doubles down on that principle of attribution as well. That’s the nature of the license so some uses even then may not be covered by that license.
Dave: Right. That attribution piece under the CC-BY license will be an important one [note: this issue is being litigated, most prominently in the Doe v. Github suit]. And then, there’s also the underlying question of what the law allows independently even if there is no license–open license or otherwise. I know right now there’s a consultation that just closed in the UK about what the law should be, and in the US, we’re fighting these things out in the courts. I think there are 39 lawsuits right now pending about various aspects of this, and a key question in most of them is just how far fair use goes. And of course, you know, if fair use applies then you don’t have to worry too much about what the license says, whether it’s CC BY or CC BY NC ND or anything else. This is like reading tea leaves but I think the prevailing case law indicates that model training and coming up with the weights has a pretty strong fair use case, but for the output side, that’s where I think it starts to stumble a little bit when you’ve got systems that are producing outputs that are substantially similar to the inputs. So I wouldn’t be surprised if in some of these suits, we get a ruling in favor of fair use and then in some of them we get a different outcome. And then, the landscape is just sort of messy.
And I suppose in the UK, I imagine y’all are watching what that legal landscape looks like around the world as it’s changing.
Ben: Yeah, absolutely.
Dave: One final question: we’ve talked a lot about licensing books for AI, but CUP has a substantial journal portfolio as well. Can you say anything about CUP’s approach to use of journal content either as AI training data or for other AI uses?
Ben: We’ve been more focussed on books, as this is where most of the demand has been to date, but we have seen a developing interest in journal content. We are, therefore, currently exploring this form of licensing in a consultative way with our journal partners.
Dave: Well, thank you for talking. And this was really, really helpful. And I think that this will be useful for authors who are trying to understand just more about what’s going on.
Today, we submitted a response to a Request for Information from the Office of Science and Technology Policy (OSTP). The OSTP is seeking to develop an “AI Action Plan,” to sustain and accelerate the development of AI in the United States. As an organization dedicated to advancing the interests of authors who wish to share their works broadly for the public good, we felt it imperative to weigh in on critical copyright and policy issues impacting AI innovation and access to knowledge.
In our response, we reaffirmed our belief that the use of copyrighted works specifically for AI training (distinct from other AI uses) is a quintessential fair use. We noted that Section 1202(b) of the Copyright Act has little utility and serves as an unnecessary stumbling block to the development of AI. We also highlighted the importance of high quality training data and pointed towards the work that is already being done to develop AI training corpora.
A Few Key Points from Our Submission
Our response to the OSTP highlights several key areas where federal policy can support both authors and a thriving AI research environment:
1. The Role of Fair Use in AI Model Training
We emphasize that fair use has long been a cornerstone of innovation in the U.S.—enabling everything from web search engines to digitization projects. US Copyright law has played a major role in both developing the incredible creative industries homed in the US, as well as driving leading scientific research and commercial innovation. The key to this innovation policy has been a thoughtful balance between providing a degree of control over copyrighted works to copyright holders while allowing for flexibility when it comes to technological innovation and new transformative uses. AI development relies on the ability to analyze large datasets, many of which include copyrighted materials. The uncertainty surrounding the legal status of AI training data due to ongoing litigation threatens to slow innovation. We urge the federal government to explicitly support the application of fair use to AI training and provide much-needed clarity.
2. Addressing the Contractual Override of Fair Use
Many AI developers face contractual barriers that limit their ability to make fair use of content, particularly in text and data mining applications. We recommend legislative measures to prevent contracts from overriding fair use rights, ensuring that AI researchers and developers can continue innovating without undue restrictions.
3. Access to High Quality Datasets
Access to high-quality datasets is a foundational pillar for AI development, enabling models to learn, refine, and iteratively improve. However, the availability of such datasets is often hindered by restrictive licensing agreements, proprietary controls, and inconsistent data standards. To maximize the potential of AI while ensuring ethical and legally sound development, collaborations between academic institutions, libraries, public archives, and technology developers are essential. Government policies should facilitate public-private partnerships that allow for robust and thoughtfully curated datasets, ensuring that AI systems are trained on a rich range of representative materials.
We invite our community of authors, researchers, and policymakers to review our submission. Your engagement is crucial in shaping a responsible and forward-thinking AI policy in the U.S. You can always reach us at info@authorsalliance.org.
some district courts have applied DMCA 1202(b) to physical copies, including textile, which means if you cut off parts of a fabric that contain copyright information, you could be liable for up to $25,000 in damages
The US Copyright Act has never been praised for its clarity or its intuitive simplicity—at a whopping 460 pages long, it is filled with hotly debated ambiguities and overly complex provisions. The copyright laws of most other jurisdictions aren’t much better.
Because of this complexity of copyright law, the implications of changes to copyright law and policy are not always clear to most authors. As we’ve said in the past, many of these issues seem arcane, and largely escape public attention. Yet entities with a vested interest in maximalist copyright—often at odds with the public interest—are certainly paying attention, and often claim to speak for all authors when they in fact represent only a small subset. As part of our efforts to advocate for a future where copyright law offers ample clarity, certainty, and real focus on values such as the advancement of knowledge and free expression, we would like to share with you two recent projects we undertook:
The 1202 Issue Brief and Amicus Brief in Doe v. Github
Authors Alliance has been closely monitoring the impact of Digital Millennium Copyright Act (DMCA) Section 1202. As we have explained in a previous post, Section 1202(b) creates liability for those who remove or alter copyright management information (CMI) or distribute works with removed CMI. This provision, originally intended to prevent wide-spread piracy, has been increasingly invoked in AI copyright lawsuits, raising significant concerns for lawful use of copyrighted materials beyond training AI. While on its face, penalties for removing CMI might seem somewhat reasonable, the scope of CMI (including a wide variety of information such as website terms of service, affiliate links, and other information) combined with the challenge of including it with all downstream distribution of incomplete copies (imagine if you had to replicate and distribute something like the Amazon Kindle terms of service every time you quoted text from an ebook) could be potentially very disruptive for many users.
In order to address the confusion regarding the (somewhat inaptly named) “identicality requirement” by the courts in the 9th Circuit, we have released an issue brief, as well undertaken to file an amicus brief in the Doe v. Github case now pending in the 9th Circuit.
Here are the key reasons why we care—and why you should care—about this seemingly obscure issue:
The Precedential Nature of Doe v. Github: The upcoming 9th Circuit case, Doe v. GitHub, will address whether Section 1202(b) should only apply when copies made or distributed are identical (or nearly identical) to the original. Lower courts have upheld this identicality requirement to prevent overbroad applications of the law, and the appellate ruling may set a crucial precedent for AI and fair use.
Potential Impact on Otherwise Legal Uses: It is not entirely certain if fair use is a defense to 1202(b) claims. If the identicality requirement is removed, Section 1202(b) could create liability for transformative fair uses, snippet reuse, text and data mining, and other lawful applications. This would introduce uncertainty for authors, researchers, and educators who rely on copyrighted materials in limited, legal ways. We advocate for maintaining the identicality requirement and clarifying that fair use applies as a defense to Section 1202 claims.
Possibility of Frivolous Litigation: Section 1202(b) claims have surged in recent years, particularly in AI-related lawsuits. The statute’s vague language and broad applicability have raised fears that opportunistic litigants could use it to chill innovation, scholarship, and creative expression.
To find out more about what’s at stake, please take a look at our 1202(b) Issue Brief. You are also invited to share your stories with us, on how you have navigated this strange statute.
Reply to the UK Open Consultation on Copyright and AI
We have members in the UK, and many of our US-based members publish in the UK. We have been watching the development in UK copyright law closely, and have recently filed a comment to the UK Open Consultation on Copyright and AI. In our comment, we emphasized the importance of ensuring that copyright policy serves the public interest. Our response’s key points include:
Competition Concerns: We alerted the policy-makers that their top objective must include preventing monopolies forming in the AI space. If licensing for AI training becomes the norm, we foresee power consolidating in a handful of tech companies and their unbridled monopoly permeating all aspects of our lives within a few decades—if not sooner.
Fair Use as a Guiding Principle: We strongly believe that the use of works in the training and development of AI models constitutes fair use under US law. While this issue is currently being tested in courts, case law suggests that fair use will prevail, ensuring that AI training on copyrighted works remains permissible. The UK does not have an identical fair use statute, but has recognized that some of its functions—such as flexibility to permit new technological uses—are valuable. We argue that the wise approach is for the UK to update its laws to ensure its creative and tech sectors can meaningfully participate in the global arena. Our comment called for a broad AI and TDM exception allowing temporary copies of copyrighted works for AI training. We emphasized that when AI models extract uncopyrightable elements, such as facts and ideas, this should remain lawful and protected.
Noncommercial Research Should Be Protected: We strongly advocated for the protection of noncommercial AI research, arguing that academic institutions and their researchers should not face legal barriers when using copyrighted works to train AI models for research purposes. Imposing additional licensing requirements would place undue burdens on academic institutions, which already pay significant fees to access research materials.
How is artificial intelligence reshaping intellectual property law? And what role does copyright play in the global AI race? Join us for a thought-provoking discussion on Copyright, AI, and Great Power Competition, a new paper by Joshua Levine and Tim Hwang that explores how different nations approach AI policy and copyright regulation—and what’s at stake in the battle for technological dominance.
This event will bring together experts to examine key legal, economic, and geopolitical questions, including:
How do copyright laws affect AI innovation?
What are the competing regulatory approaches of the U.S., China, and the EU?
How should policymakers balance creators’ rights with AI development?
Whether you’re a legal scholar, technologist, policymaker, or just curious about the intersection of AI and copyright, this conversation is not to be missed!
DOWNLOAD
DownloadCopyright, AI, and Great Power Competition.
ABOUT OUR SPEAKERS
JOSHUA LEVINE is a Research Fellow at the Foundation for American Innovation. His work focuses on policies that foster digital competition and interoperability in digital markets, online expression, and emerging technologies. Before joining FAI, Josh was a Technology and Innovation Policy Analyst at the American Action Forum, where he focused on competition in digital markets, data privacy, and artificial intelligence. He holds a BA in Political Economy from Tulane University and lives in Washington, D.C.
TIM HWANG is General Counsel and a Senior Fellow at the Foundation for American Innovation focused on the intersection of artificial intelligence and intellectual property. He is also a Senior Technology Fellow at the Institute for Progress, where he runs Macroscience. Previously, Hwang served as the General Counsel and VP Operations at Substack, as well as the global public policy lead for Google on artificial intelligence and machine learning. He is the author of Subprime Attention Crisis, a book about the structural vulnerabilities in the market for programmatic advertising.
Dubbed “The Busiest Man on the Internet” by Forbes Magazine, his current research focuses on global competition in artificial intelligence and the political economy of metascience. He holds a J.D. from Berkeley Law School and a B.A. from Harvard College.
Caption: 451 is the http error code when a webpage is unavailable for legal reasons; it is also the temperature at which books catch fire and burn. This public domain image is taken inside the Internet Archive
Imagine this: a high-profile aerospace and media billionaire threatens to sue you for writing an unauthorized and unflattering biography. In the course of writing, you rely on several news articles, including a series of in-depth pieces about the billionaire’s life written over a decade earlier. Given their closeness in time to real events, you quote, sometimes extensively, from those articles in several places.
On the eve of publication, your manuscript is leaked. Through one of his associated companies, the billionaire buys up the copyrights to the articles from which you quote. The next day the company files an infringement lawsuit against you.
Copyright Censorship: a Time-Honored Tradition
It’s easy to imagine such a suit brought by a modern billionaire—perhaps Elon Musk or Jeff Bezos. But using copyright as a tool for censorship is a time-honored tradition. In this case, Howard Hughes tried it out in 1966, using his company Rosemont Enterprises to file suit against Random House for a biography it would eventually publish.
As we’ve seen many times before and since, the courts turned to copyright’s “fair use” right to rescue the biography from censorship. Fair use, the court explained, exists so that “courts in passing upon particular claims of infringement must occasionally subordinate the copyright holder’s interest in a maximum financial return to the greater public interest in the development of art, science and industry.”
Singling out the biographical nature of the work and its importance in surfacing underlying facts, the court explained:
Biographies, of course, are fundamentally personal histories and it is both reasonable and customary for biographers to refer to and utilize earlier works dealing with the subject of the work and occasionally to quote directly from such works. . . . This practice is permitted because of the public benefit in encouraging the development of historical and biographical works and their public distribution, e.g., so “that the world may not be deprived of improvements, or the progress of the arts be retarded.”
Fair use playing this role is no accident. As the Supreme Court has explained, the relationship between copyright and free expression is complicated. On the one hand, the Court has explained, “[T]he Framers intended copyright itself to be the engine of free expression. By establishing a marketable right to the use of one’s expression, copyright supplies the economic incentive to create and disseminate ideas.” But, recognizing that such exclusive control over expression could chill the very speech copyright seeks to enable, the law contains what the Court has described as two “traditional First Amendment safeguards” to ensure that facts and ideas remain available for free reuse: 1) protections against control over facts and ideas, and 2) fair use.
But rescuing a biography that merely quotes, even extensively, from earlier articles seems like an easy call, especially when it seems so clear that the plaintiff has so clearly engineered the copyright suit not to protect legitimate economic interests but to suppress an unpopular narrative.
The world is a little more complicated now. Can fair use continue to protect free expression from excessive enforcement of copyright? I think so, but two key areas are at risk:
Fair Use and the Archives
It may have escaped your notice that large chunks of online content disappear each year.
For years, archivists have recognized and worked to address the problem. Websites going dark is an annoyance for most of us, but in some cases, it can have real implications for understanding recent history, even as officially documented. For example, back in 2013, a report revealed that well over half of the websites linked to in Supreme Court opinions no longer work, jeopardizing our understanding of just what went into why and how the Court decided an issue.
The most well-known bulwark against disappearing internet content is the Internet Archive, which has, at this point, archived over 900 billion web pages. Over and over again, we’ve seen its WayBack Machine used to shine a light on history that powerful people would rather have hidden. It’s also why the WayBack Machine has been blocked or threatened at various times in China, Russia, India, and other jurisdictions where free expression protections are weak.
It’s not just the open web that is disappearing. A recent report on the problem of “Vanishing Culture” highlights how this challenge pervades modern cultural works. Everything from 90s shareware video games to the entirety of the MTV News Archive are at risk. As Jordan Mechner, a contributor to the report explains, “historical oblivion is the default, not the exception” to the human record. As the report explains, it’s not just disappearing content that poses a problem: libraries and consumers must grapple with electronic content that can be remotely changed by publishers or others as well. As just one example among many, in just the last few years we’ve seen surreptitious modifications to ebooks on readers’ devices—some changing important aspects of the plot—for works by authors such as RL Stine, Roald Dahl, and Agatha Christie.
The case for preservation as a foundational necessity to combat censorship is straightforward. “There is no political power without power over the archive,” Jacques Derrida reminds us. Without access to a stable, high-fidelity copy of the historical record, there can be no meaningful reflection on what went right or wrong, or holding to account those in power who may oppose an accurate representation of their past.
What sometimes goes unnoticed is that, without fair use, a large portion of these preservation efforts would be illegal.
In a world where century-long copyright protection applies automatically to any human expression with even a “modicum of creativity,” virtually everything created in the last century is subject to copyright. This is a problem for digital works because practically any preservation effort involves making copies—often lots of them—to ensure the integrity of the content. Making those copies means that archivists must rely on fair use to preserve these works and make them available in meaningful ways to researchers and others.
The upshot is that every time the Internet Archive archives a website, it’s an act of faith in fair use. Is that faith well-founded?
I think so. But the answer is complicated.
For preservation efforts like those of the Internet Archive, fair use is a foundation, but not an unshakable one. Two recent cases highlight the risk, one against its book lending program and the other objecting to its “Great 78” record project. Both take issue with how the Archive provides access to preserved digital copies in its collections. While not directly attacking the preservation of those materials, the suits effectively jeopardize their effective use. As archivists have long lamented, “preservation without access is pointless.”
Beyond direct challenges to fair use, archives are threatened by spurious takedown demands, content removal requests, and legal challenges. Organizations like the Internet Archive have fought back, but many institutions simply cannot afford to, leading to a chilling effect where preservation efforts are scaled back or abandoned altogether.
Compounding this uncertainty is the growing use of technological protection measures (TPMs) and digital rights management (DRM) systems that restrict access to digital works. Under the Digital Millennium Copyright Act (DMCA), circumventing these restrictions is illegal—even for lawful purposes like preservation or research. This creates a paradox where a researcher or archivist may have a clear fair use justification for accessing and copying a work, but breaking an encryption lock to do so could expose them to legal liability.
Additionally, the rise of contractual overrides—such as restrictive licensing agreements on digital platforms—threatens to sideline fair use entirely. Many modern works, including e-books, streaming media, and even scholarly databases, are governed by terms of service that explicitly prohibit copying or analysis, even for noncommercial research. These contracts often supersede fair use rights, leaving archivists and researchers with no legal recourse.
Still, there are reasons for optimism. Courts have generally ruled favorably when fair use is invoked for transformative purposes, such as digitization for research, searchability, and access for disabled users. Landmark decisions, like those in Authors Guild v. Google and Authors Guild v. HathiTrust, upheld fair use in the context of large-scale digital libraries and text-mining projects. These cases suggest that courts recognize the essential role fair use plays in making knowledge accessible, particularly in an era of vast digital information.
Fair Use and the Freedom to Extract
One of copyright’s other traditional First Amendment protections is that the copyright monopoly does not extend to facts or ideas. Fair use is critical in giving life to this protection by ensuring that facts and ideas remain accessible, providing a “freedom to extract” (a term I borrow from law professor Molly Van Houweling’s recent scholarship) even when they are embedded within copyrighted works.
Copyright does not and cannot grant exclusive control over facts, but in practice, extracting those facts often requires using the work in ways that implicate the rightsholder’s copyright. Whether journalists referencing past reporting, historians identifying truths in archival materials, or researchers analyzing a vast corpus of written works, fair use provides the necessary legal space to operate without running afoul of copyright protections for rightsholders.
The need is more urgent than ever given the sheer scale of the modern historical record. In many cases, relying on individual researchers to sift through the record and extract important facts is impractical, if not impossible. Automated tools and processes, including AI and text data mining tools, are now indispensable for processing, retrieving, and analyzing facts from large amounts of massive amounts of text, images, and audio. From uncovering patterns in historical archives to verifying political statements against prior records, these tools serve as extensions of human analysis, making the extraction of factual information possible at an unprecedented scale. However, these technologies depend on fair use. If every instance of text or data mining required explicit permission from rights holders—who may have economic or political incentives to deny access—the ability to conduct meaningful research and discovery would be crippled.
For example, consider a researcher studying the roots of the opioid crisis, trying to mine the 4 million documents in the Opioid Industry Documents Archive—many of them legal materials, internal company communications, and regulatory filings. These documents, made public through litigation, provide critical insights into how pharmaceutical companies marketed opioids, downplayed their risks, and shaped public policy. But making sense of such a massive trove of records is impossible without computational tools that can analyze trends, track key players, and surface hidden patterns.
Without fair use, researchers could face legal roadblocks to applying text and data mining techniques to extract the facts buried within these documents. If copyright law were used to restrict or complicate access to these records, it would not only hamper academic research but also shield corporate and governmental actors from exposure and accountability.
Conclusion
As information continues to proliferate across digital media, fair use remains one of the few safeguards ensuring that historical records and cultural artifacts do not become permanently locked away behind copyright barriers. It allows the past to be examined, challenged, and understood. If we allow excessive copyright restrictions to limit the ability to extract and analyze our shared past and culture, we risk not only stifling innovation but also eroding our collective ability to engage with history and truth.
Fair Use Week
This is my contribution to Fair Use Week. The read the other excellent posts from this week, check out Kyle Courtney’s Harvard Library Fair Use Week blog here.
Uncopyrightable image generated using Google Gemini, illustrating a group of photographers excited to learn that their nearly identical photos of the public domain Washington Monument are all copyrightable) (“The Office receives ten applications, one from each member of a local photography club. All of the photographs depict the Washington Monument and all of them were taken on the same afternoon. Although some of the photographs are remarkably similar in perspective, the registration specialist will register all of the claims.”) (Compendium of Copyright Office Practices, Section 909.1)
In our comments, we urged the Copyright Office to not pursue revisions to the Copyright Act at this time and instead work towards providing greater clarity for authors of AI-generated and AI-assisted works (“Instead of proposing revisions to the Copyright Act to enshrine the human authorship requirement in law or clarify the human authorship requirement in the context of AI-generated works, the Office should continue to promulgate guidance for would-be registrants.”) We also noted that, as technology evolves in the coming years, our ideas about the copyrightability of AI-generated and AI-assisted works will likely shift as well.
We are happy to see that the USCO heard our voice and that of many others regarding no need for legislative change at this time (“The vast majority of commenters agreed that existing law is adequate in this area…”) (Report, page ii). We likewise continue to be aligned with the USCO’s view that works wholly generated by Artificial Intelligence are not copyrightable. In reading through the entirety of the report, it is clear that the Office appreciates that some elements of AI-assisted works will be copyrightable, but believes that the level of human control over the AI output will be central to the copyrightability inquiry (“Whether human contributions to AI-generated outputs are sufficient to constitute authorship must be analyzed on a case-by-case basis.”) (“Based on the functioning of current generally available technology, prompts do not alone provide sufficient control.”) (Report, page iii)
The Office’s report does provide some useful clarity. At the same time, it takes some positions that fail to adequately address the complexity of AI-generated works. Below, we will unpack a number of elements of the report that are noteworthy.
Modifying or arranging AI-generated content
The report makes it clear that the USCO views selection and arrangement of AI-generated work as a viable path towards copyrightability of works where AI was an element in the creation of the work. In 2023, when reviewing the graphic novel Zarya of the Dawn, “the Office concluded that a graphic novel comprised of human-authored text combined with images generated by the AI service Midjourney constituted a copyrightable work, but that the individual images themselves could not be protected by copyright.” (Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence, page 2) Thus, authors who incorporate AI-generated work into a larger work will often be successful in registering the whole work, but will typically need to disclaim any AI-generated elements.
Alternatively, an author who modifies an AI-generated work outside of the AI environment (e.g., an artist who uses Photoshop to make substantial modifications to an AI-generated image), will usually have a path to copyright registration with the USCO.
The USCO takes the position that most AI-assisted works are not copyrightable
Unlike an AI-generated image later modified manually by a human (which may be copyrightable), when prompt-based modifications to AI generated works are performed entirely within the AI environment, it is clear that the USCO is reluctant to view the resulting work as copyrightable.
Here, the Office’s position regarding Jason Allen’s attempts to register copyright in the two dimensional artwork Théâtre D’opéra Spatial is illuminating. In developing the image using Midjourney, Allen claimed to have used over 600 text prompts to both generate and alter the image, and further used Photoshop to “beautify and adjust various cosmetic details/flaws/artifacts, etc.,” a process which he viewed as copyrightable authorship. In denying his claim, the Office responded that “when an AI technology receives solely a prompt from a human and produces complex written, visual, or musical works in response, the ‘traditional elements of authorship’ are determined and executed by the technology—not the human user.” (88 FR 16190 – Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence, page 16192).
Within the report, there is no direct examination of the Théâtre D’opéra Spatial copyright claim and lessons to be learned from it. This is likely due to ongoing litigation between Allen and the USCO. While the USCO has significant practical influence on what materials are protectable under copyright, ultimately the decision falls to the courts. So, this suit and others like it will be important to watch. Still, the lack of a deeper dive into such a real-world example is unfortunate—such examples offer fertile territory for exploring the boundary lines between copyrightable AI-assisted works and those that will remain uncopyrightable.
The report offers a sense of possibility with regard to copyrightable AI-assisted works
Yet, the Office also acknowledges that there are remaining unanswered questions (“So I know that everyone in their particular area of creativity is looking for, you know, more examples and brighter lines. And I think at this point in time, we’re going to be learning as everyone else is learning…we will be providing more guidance as we learn more.”) (Webinar Transcript, Robert Kasunic, page 10) This recognition that the USCO, like everyone, is still learning is refreshing and welcome, given that it’s fairly easy to see that there are murky waters all around. AI-generated works are already frequently a complex hybrid of AI expression and human expression.
What are some of these questions?
The technology is still developing and it seems likely that the legal complexity will become even more pronounced as sophisticated generative AI evolves to respond to fine-grained feedback from users, while also offering expression and suggestions that many users will ultimately adopt. Navigating this complexity will be challenging and will require answering a fundamental question: what is the threshold level of human control over AI-generated expression that is necessary as a prerequisite for copyright protection?
Similarly, what standards might the Copyright Office or the courts develop to prove sufficient human authorship when it is intermingled with AI-generated content? The copyright registration process currently requires very little information and no documentation related to this question. For now, creators don’t have clear guidance on what types of documentation will be most effective if a future dispute arises.
To the extent that protection does exist in human-guided, but AI-produced content, how will or should the courts determine what are uncopyrightable, AI-generated elements in what will appear to users as a single unified work? Separating human expression that is enmeshed and embedded within uncopyrightable AI expression will require some framework for distinguishing the two in cases of infringement. Although the courts have already developed methods that may shape this (selection, filtration, abstraction, for example) it remains far from clear whether such tests will perform adequately for AI-produced content
We will be watching developments in this space closely and will continue to advocate for reasonable and flexible approaches to copyrightability that align with the practical realities of authorship in an emerging technological landscape.
On February 11, Third Circuit Judge Stephanos Bibas (sitting by designation for the U.S. District Court of Delaware) issued a new summary judgment ruling in Thomson Reuters v. ROSS Intelligence. He overruled his previous decision from 2023 which held that a jury must decide the fair use question. The decision was one of the first to address fair use in the context of AI, though the facts of this case differ significantly from the many other pending AI copyright suits.
This ruling focuses on copyright infringement claims brought by Thomson Reuters (TR), the owner of Westlaw, a major legal research platform, against ROSS Intelligence. TR alleged that ROSS improperly used Westlaw’s headnotes and the Key Number System to train its AI system to better match legal questions with relevant case law.
Westlaw’s headnotes summarize legal principles extracted from judicial opinions. (Note: Judicial opinions are not copyrightable in the US.) The Key Number System is a numerical taxonomy categorizing legal topics and cases. Clicking on a headnote takes users to the corresponding passage in the judicial text. Clicking on the key number associated with a headnote takes users to a list of cases that make the same legal point.
Importantly, ROSS did not directly ingest the headnotes and the Key Number System to train its model. Instead, ROSS hired LegalEase, a company that provides legal research and writing services, to create training data based on the headnotes and the Key Number System. LegalEase created Bulk Memos—a collection of legal questions paired with four to six possible answers. LegalEase instructed lawyers to use Westlaw headnotes as a reference to formulate the questions in Bulk Memos. LegalEase instructed the lawyers not to copy the headnotes directly.
ROSS attempted to license the necessary content directly from TR, but TR refused to grant a license because it thought the AI tool contemplated by ROSS would compete with Westlaw.
The court found that ROSS copied 2,243 headnotes from Westlaw. The court ruled that these headnotes and the Key Number System met the low legal threshold for originality and were copyrightable. The court rejected the merger and scenes à faire defense by ROSS, because, according to the court, the headnotes and the Key Number System were not dictated by necessity. The court also rejected ROSS’s fair use defense on the grounds that the 1st and 4th factors weighed in favor of TR. At this point, the only remaining issue for trial is whether some headnotes’ copyrights had expired or were untimely registered.
The new ruling has drawn mixed reactions—some saying it undermines potential fair use defenses in other AI cases, while others dismiss its significance since its facts are unique. In our view, the opinion is poorly reasoned and disregards well-established case law. Future AI cases must demonstrate why the ROSS Court’s approach is unpersuasive. Here are three key flaws we see in the ruling.
Problems with the Opinion
Near-Verbatim Summaries are “Original”?
“A block of raw marble, like a judicial opinion, is not copyrightable. Yet a sculptor creates a sculpture by choosing what to cut away and what to leave in place. … A headnote is a short, key point of law chiseled out of a lengthy judicial opinion.”
— the ROSS court
(↑example of a headnote and the uncopyrightable judicial text the headnote was based on↑)
The court claims that the Westlaw headnotes are original both individually and as a compilation, and the Key Number System is original and protected as a compilation.
“Original” has a special meaning in US copyright law: It means that a work has a modicum of humancreativity that our society would want to protect and encourage. Based on the evidence that survived redaction, it is near impossible to find creativity in any individual headnotes. The headnotes consist of verbatim copying of uncopyrightable judicial texts, along with some basic paraphrasing of facts.
As we know, facts are not copyrightable, but expressions of facts often are. One important safeguard for protecting our freedom to reference facts is the merger doctrine. US law has long recognized that when there are only limited ways to express a fact or an idea, those expressions are not considered “original.” The expressions “merge” with the underlying unprotectable fact, and become unprotectable themselves.
Judge Bibas gets merger wrong—he claims merger does not apply here because “there are many ways to express points of law from judicial opinions.” This view misunderstands the merger doctrine. It is the nature of human language to be capable of conveying the same thing in many different ways, as long as you are willing to do some verbal acrobatics. But when there are only a limited number of reasonable, natural ways to express a fact or idea—especially when textual precision and terms of art are used to convey complex ideas—merger applies.
There are many good reasons for this to be the law. For one, this is how we avoid giving copyright protection to concise expression of ideas. Fundamentally, we do not need to use copyright to incentivize the simple restatement of facts. As the Constitution intended, copyright law is designed to encourage creativity, not to grant exclusive rights to basic expressions of facts. We want people to state facts accurately and concisely. If we allowed the first person to describe a judicial text in a natural, succinct way to claim exclusive rights over that expression, it would hinder, rather than facilitate, meaningful discussion of said text, and stifle blog posts like this one.
As to the selection and arrangement of the Key Number System, the court claims that originality exists here, too, because “there are many possible, logical ways to organize legal topics by level of granularity,” and TR exercised some judgment in choosing the particular “level” with its Key Number System. However, the cases are tagged with Key Number System by an automated computer system, and the topics closely mirror what law schools teach their first-year students.
The court does not say much about why the compilation of the headnotes should receive separate copyright protection, other than that it qualifies as original “factual compilations.” This claim is dubious because the compilation is of uncopyrightable materials, as discussed, and the selection is driven by the necessity to represent facts and law, not by creativity. Even if the compilation of headnotes is indeed copyrightable, using portions of it that are uncopyrightable is decidedly not an infringement, because the US does not protect sui generis database rights.
Can’t Claim Fair Use When Nobody Saw a Copy?
“[The intermediate-copying cases] are all about copying computer code. This case is not.”
— the ROSS court conveniently ignoring Bellsouth Advertising & Publishing Corp. v. Donnelley Information Publishing, Inc., 933 F.2d 952 (11th Cir. 1991) and Sundeman v. Seajay Society, Inc., 142 F. 3d 194 (4th Cir. 1998).
In deciding whether ROSS’s use of Westlaw’s headnotes and the Key Number System is transformative under the 1st factor, the court took a moment to consider whether the available intermediate copying case law is in favor of ROSS, and quickly decided against it.
Even though no consumer ever saw the headnotes or the Key Number System in the AI products offered by ROSS, the court claims that the copying of these constitutes copyright infringement because there existed an intermediate copy that contained copyright-restricted materials authored by Westlaw. And, according to the court, intermediate copying can only weigh in favor of fair use for computer codes.
Before turning to the actual case law the court is overlooking here, we wonder if Judge Bibas is in fact unpersuaded by his own argument: under the 3rd fair use factor, he admits that only the content made accessible to the public should be taken into consideration when deciding what amount is taken from a copyrighted work compared to the copyrighted work as a whole, which is contrary to what he argues under the 1st factor—that we must examine non-public intermediate copies.
Intermediate copying is the process of producing a preliminary, non-public work as an interim step in the creation of a new public-facing work. It is well established under US jurisprudence that any type of copying, whether private or public, satisfies a prima facie copyright infringement claim, but, the fact that a work was never shared publicly—nor intended to be shared publicly—strongly favors fair use. For example, in Bellsouth Advertising & Publishing Corp. v. Donnelley Information Publishing, Inc., the 11th Circuit Court decided that directly copying a competitor’s yellow pages business directory in order to produce a competing yellow pages was fair use when the resulting publicly accessible yellow pages the defendant created did not directly incorporate the plaintiff’s work. Similarly, in Sundeman v. Seajay Society, Inc., the Fourth Circuit concluded that it was fair use when the Seajay Society made an intermediary, entire copy of plaintiffs’ unpublished manuscript for a scholar to study and write about it. The scholar wrote several articles about it mostly summarizing important facts and ideas (while also using short quotations).
There are many good reasons for allowing intermediate copying. Clearly, we do not want ALL unlicensed copies to be subject to copyright infringement lawsuits, particularly when intermediate copies are made in order to extract unprotectable facts or ideas. More generally, intermediate copying is important to protect because it helps authors and artists create new copyrighted works (e.g., sketching a famous painting to learn a new style, translating a passage to practice your language skills, copying the photo of a politician to create a parody print t-shirt).
Suddenly, We Have an AI Training Market?
“[I]t does not matter whether Thomson Reuters has used [the headnotes and the Key Number System] to train its own legal search tools; the effect on a potential market for AI training data is enough.”
— the ROSS court
The 4th fair use factor is very much susceptible to circular reasoning: if a user is making a derivative use of my work, surely that proves a market already exists or will likely develop for that derivative use, and, if a market exists for such a derivative use, then, as the copyright holder, I should have absolute control over such a market.
The ROSS court runs full tilt into this circular trap. In the eyes of the court, ROSS, by virtue of using Westlaw’s data in the context of AI training, has created a legitimate AI training data market that should be rightfully controlled by TR.
Only that our case law suggests the 4th factor “market substitution” considers only markets which are traditional, reasonable or likely to be developed. As we have already pointed out in a previous blog post, copyright holders must offer concrete evidence to prove the existence, or likelihood of developing, licensing market, before they can argue a secondary use serves as “market substitute.” If we allowed a copyright holder’s protected market to include everything that he’s willing to receive licensing fees for, it will all but wipe out fair use in the service of stifling competition.
Conclusion
The impact of this case is currently limited, both because it is a district court ruling and because it concerns non-generative AI. However, it is important to remain vigilant, as the reasoning put forth by the ROSS court could influence other judges, policymakers, and even the broader public, if left unchallenged.
This ruling combines several problematic arguments that, if accepted more widely, could have significant consequences. First, it blurs the line between fact and expression, suggesting that factual information can become copyrightable simply by being written down by someone in a minimally creative way. Second, it expands copyright enforcement to intermediate copies, meaning that even temporary, non-public use of copyrighted material could be subject to infringement claims. Third, it conjures up a new market for AI training data, regardless of whether such a licensing market is legitimate or even likely to exist.
If these arguments gain traction, they could further entrench the dominance of a few large AI companies. Only major players like Microsoft and Meta will be able to afford AI training licenses, consolidating control over the industry. The AI training licensing terms will be determined solely between big AI companies and big content aggregators, without representation of individual authors or public interest. The large content aggregators will get to dictate the terms under which creators must surrender rights to their works for AI training, and the AI companies will dictate how their AI models can be used by the general public.
Without meaningful pushback and policy intervention, smaller organizations and individual creators cannot participate fairly. Let’s not rewrite our copyright laws to entrench this power imbalance even further.
Today, we’re pleased to announce a new project generously supported by the John S. and James L. Knight Foundation. The project, “Artificial Intelligence, Authorship, and the Public Interest,” aims to identify, clarify, and offer answers to some of the most challenging copyright questions posed by artificial intelligence (AI) and explain how this new technology can best advance knowledge and serve the public interest.
Artificial intelligence has dominated public conversation about the future of authorship and creativity for several years. Questions abound about how this technology will affect creators’ incentives, influence readership, and what it might mean for future research and learning.
At the heart of these questions is copyright law. Over two dozen class-action copyright lawsuits have been filed between November 2022 and today against companies such as Microsoft, Google, OpenAI, Meta, and others. Additionally, congressional leadership, state legislatures, and regulatory agencies have held dozens of hearings to reconcile existing intellectual property law with artificial intelligence. As one of the primary legal mechanisms for promoting the “progress of science and the useful arts,” copyright law plays a critical role in creating, producing, and disseminating information.
We are convinced that how policymakers shape copyright law in response to AI will have a lasting impact on whether and how the law supports democratic values and serves the common good. That is why Authors Alliance has already devoted considerable effort to these issues, and this project will allow us to expand those efforts at this critical moment.
AI Legal Fellow As part of the project, we’re pleased to add an AI Legal Fellow to our team to support the project. The position requires a law degree and demonstrated interest and experience with artificial intelligence, intellectual property, and legal technology issues. We’re particularly interested in someone with a demonstrated interest in how copyright law can serve the public interest. This role will require significant research and writing. Pay is $90,000/yr, and it is a two-year term position. Read more about the position here. We’ll begin reviewing applications immediately and do interviews on a rolling basis until filled.
As we get going, we’ll have much more to say about this project. We will have some funds available to support research subgrants, organize several workshops and symposia, and offer numerous opportunities for public engagement.
About the John S. and James L. Knight Foundation We are social investors who support democracy by funding free expression and journalism, arts and culture in community, research in areas of media and democracy, and in the success of American cities and towns where the Knight brothers once had newspapers. Learn more at kf.org and follow @knightfdn on social media.