Author Archives: Thomas Padilla

The Public Interest Corpus Update – Boston Edition 

Posted March 26, 2025

On March 3, librarians, authors, publishers, and technologists gathered at Northeastern University Library in Boston to contribute to a startup plan for The Public Interest Corpus. The Public Interest Corpus is focused on supporting the creation of high-quality AI training data from memory organizations (e.g., libraries, archives, museums) and their partners (e.g., publishers) that advance the public interest. For too long, access to high-quality training data has been limited to the world’s most well-resourced organizations pushing others toward data of lesser quality, comprehensiveness, and dubious legality. Over the course of our day together, event participants made strong contributions to the development of The Public Interest Corpus startup plan. 

Refining Principles and Goals

We began the day with an exercise focused on refining The Public Interest Corpus principles and goals. We felt this was a good place to start given that principles and goals held in common provide the foundation for collective action. Participants contributed a broad range of comments, edits, and suggestions that greatly strengthened project principles and goals. The project team is in the process of versioning this document and will have more to share down the line. 

In the interim, we share some takeaways:

  • The Public Interest Corpus should maximize transparency. Participants called for transparency around corpus composition and emphasized how this could support reproducible research and the development of AI that is pluralistic and grounded in particular social contexts. 
  • “Public Interest” is a compelling framing that needs to be made more concrete. Participants expressed a need for more specificity regarding target user communities served by The Public Interest Corpus and strategies that effectively balance public interests and commercial interests in The Public Interest Corpus. 

Workshopping Core Challenges

Following the principles and goals activity, we broke participants into mixed stakeholder (author, publisher, legal expert, librarian, technical expert) groups. Group composition was shuffled once more in the afternoon to encourage continued novelty in ideation. Each group was presented with a set of questions to respond to that aligned with the following challenge areas: (1) Target Audiences, Training Data Needs, Potential Partnerships, (2) Legal and Policy, and (3) Business Model, Sustainability, and Governance. As with the principles and goals exercise, the project team is actively processing the product of group activity and will have more to share in the future. 

In the interim, we share some takeaways: 

Target Audiences, Training Data Needs, Potential Partnerships

  • Accessing in-copyright books for AI training purposes is extremely difficult for researchers. Challenges include but are not limited to downstream impact of contractual override, organizational uncertainty in making fair use determinations, and multiple active court cases testing AI training as a fair use arguments in the United States. 
  • Focus on simple solutions. Multiple participants suggested that focusing on a simple solution was the best path forward – i.e., identify the most compelling minimum viable product and deliver on it. A solution could become more complex over time through phased development informed by user community studies. 

Legal and Policy Challenges

  • Balancing what is legally permissible vs. meeting normative community expectations is essential. While it may be the case that AI training on in-copyright works is a fair use, this does not mean that a proposed solution should make works available for training without author or publisher engagement. This effort can learn from engagement with author communities to assess their views and preferences regarding the use of their work for AI training purposes. Authors Alliance continually engages with authors on this issue. 
  • Pending court cases do not provide an insurmountable barrier to a solution. Though active copyright AI litigation is likely to continue for many years, participants believe there are a range of strategies that can be pursued that mitigate legal risk and support development of a solution that advances the public interest. 

Business Model, Sustainability, and Governance 

  • The Public Interest Corpus should develop multiple prospective business models and test for viability with stakeholders. Participants have indicated that it would be useful for stakeholders to engage with a range of business models with different revenue streams – e.g., membership model, philanthropically supported, commercially supported, hybrid, etc. Participants suggested paths forward that led to creation of a standalone organization or integration within an existing organization. 
  • With an eye toward mission and policy alignment, business models should differentiate between noncommercial and commercial use of The Public Interest Corpus. Potential service costs should be responsive to resource disparities between non-commercial and commercial users. Potential services costs should also be responsive to resource variation within a prospective non-commercial user base. 

Next Steps 

We plan to continue engaging core challenges with stakeholders at our next workshop, to be held July 2025 in New York City. If you work in the region and are interested in potentially attending please let us know here

In addition to the July 2025 workshop, we will present on the project and/or hold additional working events across North America. The next presentation will be at the Coalition for Networked Information meeting in Milwaukee in April. To keep track of future community engagements please refer to our engagements page. 

The Public Interest Corpus: An Update and Opportunities for Co-Development 

Posted February 24, 2025
A Library salute to National Photography Month and the photographer’s skill for staging eye-catching compositions

In December 2024 we announced a new project to develop a public interest AI training corpus focused on books. Over the last few months we’ve been actively engaging a diverse set of stakeholders in the development of The Public Interest Corpus

The Public Interest Corpus is focused on developing large-scale, high-quality AI training data from the world’s memory organizations that serve the public interest. In the aggregate, memory organizations like libraries and archives are in a prime position to address this need given a multi-century focus on developing high-quality, locally and globally comprehensive collections of books, newspapers, scholarly journals, photographs, manuscript materials, and more. We seek to prioritize uses of The Public Interest Corpus that promote learning, access to knowledge, and broad benefits to the public. 

Project Team and Advisory Board

The  project team consists of Dave Hansen, Executive Director of Authors Alliance and Dan Cohen, Vice Provost for Information Collaboration, Dean of the Library, and Professor of History at Northeastern University. In January, I joined the team as the Public Interest AI Strategist. In this capacity I will leverage extensive experience developing community around responsible computational use of memory organization collections as data and responsible AI.  Giulia Taurino, recently joined the team as Project Coordinator. Giulia holds a doctoral degree in Media Studies and Visual Arts from the University of Bologna and the University of Montreal and is currently a member of the NULab for Digital Humanities and Computational Social Science and of AI & Arts interest group at The Alan Turing Institute.

The project team is guided by a strong advisory board composed of senior leaders and experts who think deeply about how authors, libraries, and AI can better serve the public interest. 

  • David Bamman, Associate Professor, UC Berkeley School of Information
  • Sandra Aya Enimil, Director of Scholarly Communications and Collection Strategy, Yale University Library
  • Mike Furlough, Executive Director, HathiTrust
  • David Smith, Associate Professor, Khoury College of Computer Sciences, Northeastern University
  • Claire Stewart, Dean of Libraries and University Librarian, University of Illinois, Urbana-Champaign 
  • Mehtab Khan, Assistant Professor of Law at Cleveland State University College of Law
  • Rachael Samberg, Director,  Scholarly Communications and Information Policy, UC Berkeley Library
  • Robin Sloan, NY Times best selling science fiction author
  • Günter Waibel, Associate Vice Provost & Executive Director, California Digital Library
  • Martha Whitehead, Vice President for the Harvard Library and University Librarian, Harvard University
  • John Wilkin, CEO, LYRASIS
  • Suzanne Wones, University Librarian, UC Berkeley Library
  • Ted Underwood, Professor of Information Science and English, University of Illinois at Urbana Champaign

How you can get involved 

Over the next year the project team will engage a diverse set of stakeholders in a co-development process that directly informs The Public Interest Corpus priorities, strategies, and partnerships. To kick things off we are holding a working event at Northeastern University Library in Boston, Massachusetts on March 3 where a group of senior library administrators, publishers, disciplinary researchers, authors, and technical experts will workshop core legal, technical, business model, and governance challenges. 

Moving forward we intend to hold additional focused in-person and virtual working events with a broad range of communities. We strongly believe that engaging with diverse stakeholders in a co-development process for this effort will be key to success. If you are interested in participating in a future event, hosting a Public Interest Corpus event, or have other ideas for how we might collaborate please let us know via the following form.

We look forward to advancing a public interest solution with you all.