
On March 3, librarians, authors, publishers, and technologists gathered at Northeastern University Library in Boston to contribute to a startup plan for The Public Interest Corpus. The Public Interest Corpus is focused on supporting the creation of high-quality AI training data from memory organizations (e.g., libraries, archives, museums) and their partners (e.g., publishers) that advance the public interest. For too long, access to high-quality training data has been limited to the world’s most well-resourced organizations pushing others toward data of lesser quality, comprehensiveness, and dubious legality. Over the course of our day together, event participants made strong contributions to the development of The Public Interest Corpus startup plan.
Refining Principles and Goals
We began the day with an exercise focused on refining The Public Interest Corpus principles and goals. We felt this was a good place to start given that principles and goals held in common provide the foundation for collective action. Participants contributed a broad range of comments, edits, and suggestions that greatly strengthened project principles and goals. The project team is in the process of versioning this document and will have more to share down the line.
In the interim, we share some takeaways:
- The Public Interest Corpus should maximize transparency. Participants called for transparency around corpus composition and emphasized how this could support reproducible research and the development of AI that is pluralistic and grounded in particular social contexts.
- “Public Interest” is a compelling framing that needs to be made more concrete. Participants expressed a need for more specificity regarding target user communities served by The Public Interest Corpus and strategies that effectively balance public interests and commercial interests in The Public Interest Corpus.
Workshopping Core Challenges
Following the principles and goals activity, we broke participants into mixed stakeholder (author, publisher, legal expert, librarian, technical expert) groups. Group composition was shuffled once more in the afternoon to encourage continued novelty in ideation. Each group was presented with a set of questions to respond to that aligned with the following challenge areas: (1) Target Audiences, Training Data Needs, Potential Partnerships, (2) Legal and Policy, and (3) Business Model, Sustainability, and Governance. As with the principles and goals exercise, the project team is actively processing the product of group activity and will have more to share in the future.
In the interim, we share some takeaways:
Target Audiences, Training Data Needs, Potential Partnerships
- Accessing in-copyright books for AI training purposes is extremely difficult for researchers. Challenges include but are not limited to downstream impact of contractual override, organizational uncertainty in making fair use determinations, and multiple active court cases testing AI training as a fair use arguments in the United States.
- Focus on simple solutions. Multiple participants suggested that focusing on a simple solution was the best path forward – i.e., identify the most compelling minimum viable product and deliver on it. A solution could become more complex over time through phased development informed by user community studies.
Legal and Policy Challenges
- Balancing what is legally permissible vs. meeting normative community expectations is essential. While it may be the case that AI training on in-copyright works is a fair use, this does not mean that a proposed solution should make works available for training without author or publisher engagement. This effort can learn from engagement with author communities to assess their views and preferences regarding the use of their work for AI training purposes. Authors Alliance continually engages with authors on this issue.
- Pending court cases do not provide an insurmountable barrier to a solution. Though active copyright AI litigation is likely to continue for many years, participants believe there are a range of strategies that can be pursued that mitigate legal risk and support development of a solution that advances the public interest.
Business Model, Sustainability, and Governance
- The Public Interest Corpus should develop multiple prospective business models and test for viability with stakeholders. Participants have indicated that it would be useful for stakeholders to engage with a range of business models with different revenue streams – e.g., membership model, philanthropically supported, commercially supported, hybrid, etc. Participants suggested paths forward that led to creation of a standalone organization or integration within an existing organization.
- With an eye toward mission and policy alignment, business models should differentiate between noncommercial and commercial use of The Public Interest Corpus. Potential service costs should be responsive to resource disparities between non-commercial and commercial users. Potential services costs should also be responsive to resource variation within a prospective non-commercial user base.
Next Steps
We plan to continue engaging core challenges with stakeholders at our next workshop, to be held July 2025 in New York City. If you work in the region and are interested in potentially attending please let us know here.
In addition to the July 2025 workshop, we will present on the project and/or hold additional working events across North America. The next presentation will be at the Coalition for Networked Information meeting in Milwaukee in April. To keep track of future community engagements please refer to our engagements page.