Over the last year, we’ve seen a number of major deals inked between companies like OpenAI and news publishers. In July 2023, OpenAI entered into a two-year deal with The Associated Press for ChatGPT to ingest the publisher’s news stories. In December 2023, Open AI announced its first non-US partnership to train ChatGPT on German publisher Axel Springer’s content, including Business Insider. This was then followed by a similar deal in March 2024 with Le Monde and Prisa Media, news publishers from France and Spain. These partnerships are likely sought in an effort to avoid litigation like the case OpenAI and Microsoft are currently defending from the New York Times.
As it turns out, such deals are not limited to OpenAI or newsrooms. Book publishers have also gotten into the mix. Numerous reports recently pointed out that based on Taylor and Francis’s parent company’s market update, the British academic publishing giant has agreed to a $10 million USD AI training deal with Microsoft. Earlier this year, another major academic publisher, John Wiley and Sons, recorded $23 million in one-time revenue from a similar deal with a non-disclosed tech company. Meta even considered buying Simon & Schuster or paying $10 per book to acquire its rights portfolio for AI training.
With few exceptions (a notable one being Cambridge University Press), publishers have not bothered to ask their authors whether they approve of these agreements.
Does AI training require licensing to begin with?
First, it’s worth appreciating that these deals are made in the backdrop of some legal uncertainty. There are more than two dozen AI copyright lawsuits just in the United States, most of them turning on one key question: whether AI developers should have to obtain permission to scrap content to train AI models or whether fair use already allows this kind of training use even without permission.
The arguments for and against fair use for AI training data are well explained elsewhere. We think there are strong arguments, based on cases like Authors Guild v. Google, Authors Guild v. HathiTrust, and AV ex rel Vanderhye v. iParadigms, that the courts will conclude that copying to training AI models is fair use. We also think there are really good policy reasons to think this could be a good outcome if we want to encourage future AI development that isn’t dominated by only the biggest tech giants and that results in systems that produce less biased outputs. But we won’t know for sure whether fair use covers any and all AI training until some of these cases are resolved.
Even if you are firmly convinced that fair use protects this kind of use (and AI model developers have strong incentives to hold this belief), there are lots of other reasons why AI developers might seek licenses in order to navigate around the issue. This includes very practical reasons, like securing access to content in formats that make training easier, or content accompanied by structured, enhanced metadata. Given the pending litigation, licenses are also a good way to avoid costly litigation (copyright lawsuits are expensive, even if you win).
Although one can hardly blame these tech companies for making a smart business decision to avoid potential litigation, this could have a larger systematic impact on other players in the field, including the academic researchers who would like to rely on fair use to train AI. As IP scholar James Gibson explains, when risk-averse users create new licensing markets in gray areas of copyright law, copyright holders’ exclusive rights expands, and public interest diminishes. The less we rely on fair use, the weaker it becomes.
Finally, it’s worth noting that fair use is only available in the US and a few other jurisdictions. In other jurisdictions, such as within the EU, using copyrighted materials for AI training (especially for commercial purposes) may require a license.
To sum up: even though it may not be legally necessary to acquire copyright licenses for AI training, it seems that licensing deals between publishers and AI companies are highly likely to continue.
So, can publishers just do this without asking authors?
In a lot of cases, yes, publishers can license AI training rights without asking authors first. Many publishing contracts include a full and broad grant of rights–sometimes even a full transfer of copyright to the publisher for them to exploit those rights and to license the rights to third parties. For example, this typical academic publishing agreement provides that “The undersigned authors transfer all copyright ownership in and relating to the Work, in all forms and media, to the Proprietor in the event that the Work is published.” In such cases, when the publisher becomes the de facto copyright holder of a work, it’s difficult for authors to stake a copyright claim when their works are being sold to train AI.
Not all publishing contracts are so broad, however. For example, in the Model Publishing Contract for Digital Scholarship (which we have endorsed), the publisher’s sublicensing rights are limited and specifically defined, and profits resulting from any exploitation of a work must be shared with authors.
There are lots of variations, and specific terms matter. Some publisher agreements are far more limited–transferring only limited publishing and subsidiary rights. These limitations in the past have prompted litigation over whether the publisher or the author gets to control rights for new technological uses. Results have been highly dependent on the specific contract language used.
There are also instances where publishers aren’t even sure of what they own. For example, in the drawn-out copyright lawsuit brought by Cambridge University Press, Oxford University Press and Sage against Georgia State University, the court dropped 16 of the alleged 74 claimed instances of infringement because the publishers couldn’t produce documentation that they actually owned rights in the works they were suing over. This same lack of clarity contributed to the litigation and proposed settlement in the Google Books case, which is probably our closest analogy in terms of mass digitization and reuse of books (for a good discussion of these issues, see page 479 of this law review article by Pamela Samuelson about the Google Books settlement).
This is further complicated by the fact that authors sometimes are entitled to reclaim their rights, such as by rights reversion clause and copyright termination. Just because a publisher can produce the documentation of a copyright assignment, does not necessarily mean that the publisher is still the current copyright holder of a work.
We think it is certainly reasonable to be skeptical about the validity of blanket licensing schemes between large corporate rights holders and AI companies, at least when they are done at very large scale. Even though in some instances publishers do hold rights to license AI training, it is dubious whether they actually hold, and sufficiently document, all of the purported rights of all works being licensed for AI training.
Can authors at least insist on a cut of the profit?
It can feel pretty bad to discover that massively profitable publishers are raking in yet more money by selling licensing rights to your work, while you’re cut out of the picture. If they’re making money, why not the author?
It’s worth pointing out that, at least for academic authors, this isn’t exactly a novel situation–most academic authors make very little in royalties on their books, and nothing at all on their articles, while commercial publishers like Elsevier, Wiley, and SpringerNature sell subscription access at a healthy profit. Unless you have retained sublicensing rights, or your publishing contract has a profit-sharing clause, authors, unfortunately, are not likely to profit from the budding licensing market for AI training.
So what are authors to do?
We could probably start most posts like this with a big red banner that says “READ YOUR PUBLISHING CONTRACT!! (and negotiate it too)” Be on the lookout for what you are authorizing your publisher to do with your rights, and any language in it about reuse or the sublicensing of subsidiary rights.
You might also want to look for terms in your contract that speak to royalties and shares of licensing revenue. Some contracts have language that will allow you to demand an accounting of royalties; this may be an effective means of learning more about licensing deals associated with your work.
You can also take a closer look at clauses that allow you to revert rights–many contracts will include a clause under which authors can regain rights when their book falls below a certain sales threshold or otherwise becomes “out of print.” Even without such clauses, it is reasonable for authors to negotiate a reversion of rights when their books are no longer generating revenue. Our resources on reversion will give you a more in-depth look at this issue.
Finally, you can voice your support for fair use in the context of licensing copyrighted works for AI training. We think fair use is especially important to preserve for non-commercial uses. For example, academic uses could be substantially stifled if paid-for licensing for permission to engage in AI research or related uses becomes the norm. And in those cases, the windfall publishers hope to pocket isn’t coming from some tech giant, but ultimately is at the expense of researchers, their libraries and universities, and the public funding that goes to support them.