Authors Alliance thanks Matthew Sag, professor at Loyola University Chicago School of Law, for this guest post (originally published on InfoJustice.org).
The World Intellectual Property Organization in Geneva has requested comments on a series of questions about whether “use of the data subsisting in copyright works without authorization for machine learning constitute an infringement of copyright?” I have joined other copyright experts in a submission to WIPO commenting on their questions. This note explains in more detail some of my reservations about use of the phrase “use of the data subsisting in copyright works without authorization” in WIPO’s questions and in our general thinking about the relation between copyright and text and data mining.
The phrase “use of the data subsisting in copyright works without authorization” is unhelpful, to say the least.
To begin with the most obvious problem, the “use” of the data or facts subsisting in copyright works generally requires no authorization. For example, this morning I “used the data” in on the weather page of to my local newspaper to decide whether I should shovel snow or wait for more snow to fall. No doubt, the newspaper is protected by copyright, but the facts contained therein are not.
Moreover, the second problem with the question WIPO proposed is that my “use” of the weather data required no authorization because it did not involve any action on my part implicating the exclusive rights of the copyright owner. I did not make a copy of the newspaper, I did not publicly perform it, I did not turn it into a digital audio transmission, etc.
Both of these points are Copyright 101, but it is easy to lose sight of the fundamentals when contemplating new and unexpected uses of copyrighted works in a rapidly evolving technological environment. It does not make sense to ask “Should the use of the data subsisting in copyright works without authorization for machine learning constitute an infringement of copyright?” in the abstract. Instead, we need to focus with more precision on the potential copyright issues that are actually raised by AI in particular contexts, and to do that we need to understand the relationship between text data mining, on the one hand, and machine learning and AI, on the other.
Text data mining refers to any computational processes for applying structure to unstructured electronic texts and it generally involves employing statistical methods to discover new information and reveal patterns in the processed data.[1]
Machine learning refers to a cluster of statistical and programming techniques that give computers the ability to “learn” from exposure to data, without being explicitly programmed.[2]
The term AI or artificial intelligence is mostly used to refer more sophisticated forms of machine learning, or else to describe speculative accounts of what might be possible with future technology. If we put science fiction and hyperbole to one side, we can proceed to talk about machine learning and AI interchangeably in terms of the relevant copyright issues.
If moving beyond the premise that AI is a magical process that defies human understanding, we can see the third fundamental problem with the phrase “the data subsisting in copyright works.” The notion that AI is using data that “subsists” in copyright works reflects a fundamental misunderstanding of the technology at issue. Unless the copyrighted work is something like a book of used car values, the data does not subsist in the work waiting to be extracted. The data is not a subset of the work. In almost every real-world use case of AI and machine learning, the data is derived by making an external observation about the work.
This is an important point: the non-expressive metadata produced by text data mining does not originate from the underlying copyrighted works. It does not subsist in those works. Instead, it is derived from them by acts of external observation.
As I have explained in a recent paper:
“Imagine plagiarism detection software that reports that student term paper B is substantially similar to an earlier paper A. Paper A originated with student author A, but the observation as to its similarity with student B’s term paper does not originate with either A or B. It originates with the software algorithm programmed to detect plagiarism.
Likewise, a word frequency table derived from Moby Dick did not originate with Herman Melville. Melville obviously realized that he would be writing the word “whale” over and over, but presumably he never set out to make an exact count. In both examples, to the extent the metadata about the work owes its origin to anyone, that person would be the person who derived the data, not the author of the underlying work.”[3]
The false premise that the non-expressive metadata produced by text data mining already “subsists” in the copyrighted works from which it is derived leads to false conclusion that when the data is used, something is taken from the original author. On the contrary, producing non-expressive metadata takes nothing from the original author because under any version of the idea-expression distinction, latent facts are not the property of the author. But even if they were, these are not their facts.