Yesterday there was a pretty interesting class action lawsuit filed against Github and Microsoft. The suit is about Github’s Copilot service, which it advertises as “Your AI pair programmer.” As described by Github, Copilot is “trained on billions of lines of code” and “turns natural language prompts into coding suggestions across dozens of languages.” The suit focuses on Github’s reuse of code deposited with it by programers, mostly under open source licenses, which Github has used to train the Copilot AI. Those licenses generally allow reuse but commonly come with strings attached–such as requiring attribution and relicensing the new work under the same or similar terms. The class action asserts, among other things, that Github hasn’t followed those terms because it hasn’t attributed the source adequately and has removed copyright-relevant information.
Sounds interesting, but you might be wondering why we care about this lawsuit. For a few reasons: one, it raises some important questions about the extent to which researchers can use AI to train and produce outputs based on datasets of copyrighted materials, even materials thought generally “safe” because they’re available under open licenses. As the suit highlights, materials that are openly licensed aren’t without any restrictions (most include attribution requirements), but when those materials are aggregated and used to craft new outputs, it can be seriously complicated to find the right way to attribute all the underlying creators. If this suit raises the barrier to using such materials, it could pose real problems for many existing research projects. It could also result in further narrowing of what datasets are likely to be used by AI researchers– resulting an even smaller group of materials that include what law professor Amanda Levendowski refers to as “biased, low-friction data” (BLFD), which can lead to some pretty bad and biased results. How and when open license attribution requirements apply is important for anyone doing research with such materials in aggregate.
Second, the suit at least indirectly implicates some of the same legal principles that authors working on text-data mining projects rely on. We’ve argued (successfully, before the U.S. Copyright Office) that such uses are generally not infringing–-particularly for research and educational purposes-–because fair use allows for it. Several others, such as Professors Michael Carroll and Matthew Sag, have made similar arguments. Of course, Github Copilot has some meaningful differences from text-data mining for academic research; e.g., it is producing textual outputs based on the underlying code for a commercial application. But the fair use issue in this case could have a direct impact on other applications.
Interestingly, the Github Copilot suit doesn’t actually allege copyright infringement, which is how fair use would most naturally be raised as a defense. Instead, the plaintiffs, as class representatives, make two claims that could implicate a fair use defense: 1) a contractual claim Github has violated the open source license covering the underlying code, which generally require attribution among other things; 2) a claim Github has violated Section 1202 of the Digital Millennium Copyright Act by removing copyright management information (“CMI”) (e.g., copyright notice, titles of the underlying works).
The complaint attempts to avoid fair use issue, asserting that ”the Fair Use affirmative defense is only applicable to Section 501 copyright infringement. It is not a defense to violations of the DMCA, Breach of Contract, nor any other claim alleged herein.” The plaintiffs may well be trying to follow the playbook of another recent open source licensing case, Software Freedom Conservancy v. Vizio, which successfully convinced a federal court that its breach of contract claims, based on an alleged breach of the the GPLv2 license, should be considered separate and apart from a copyright fair use defense.
This suit is a little different though. For one, at least five of the eleven licenses at issue explicitly recognize the applicability of fair use; for example, the GNU General Public License version 3 provides that “This License acknowledges your rights of fair use or other equivalent, as provided by copyright law.” It would seem more of a challenge to convince a court that a fair use defense doesn’t matter when almost half of the licenses explicitly say it does. Likewise, while the text of Section 1202 doesn’t explicitly allow for a fair use defense, its restrictions are only applicable to the removal of CMI when it is done “without the authority of the copyright owner or the law.” The plaintiffs claim that fair use isn’t a defense to allegations of a Section 1202 violation, but thats far from clear, and it may be that removal of information pursuant to a valid fair use claim should qualify as removal with the “authority . . . of the law.”
The lawsuit is a class action, so it faces some special hurdles that a typical suit would not. For example, the plaintiffs must demonstrate that they can adequately represent the interests of the class, which it has defined as:
All persons or entities domiciled in the United States that, (1) owned an interest in at least one US copyright in any work; (2) offered that work under one of GitHub’s Suggested Licenses; and (3) stored Licensed Materials in any public GitHub repositories at any time between January 1, 2015 and the present (the “Class Period”).
That could pose a challenge given that it seems likely that at least a portion–if not a sizable portion–of those who contributed code to Github under those open licenses may be more sympathetic to Github’s reuse than the claims of the plaintiffs. In Authors Guild v. Google, another class action suit involving mass copying to facilitate computer-aided search and outputs like snippet view in Google Books, similar intra-class conflicts posed a challenge to class certification (including objections we raised on behalf of academic authors). The Github Copilot suit also includes a number of other claims that mean it could be resolved without addressing the copyright and licensing issues noted above. For now, we’ll monitor the case and update you on outcomes relevant to authors.