The legality of using copyrighted training data for generative AI models
Copy Link
There are a multitude of concerns around content and artificial intelligence (AI), including deep fakes and the impact on public misinformation, whether generative AI works can be copyrighted, privacy concerns for personal data, and what can be freely used as training data in LLMs and other generative models. In August 2023, we explored the implications of copyright law and generative AI content on the gaming industry, and came to the conclusion that “we expect a majority of the copyright infringement cases against generative AI companies to fail in the courts and their activities will be protected primarily under the ‘transformative’ character of the text and images produced.” Transformative uses are more likely to be considered fair by US copyright law and are defined as, “uses [that] add something new, with a further purpose or different character, and do not substitute for the original use of the work” (US Copyright Office).
Since then, a number of cases have progressed through the courts and new lawsuits have been filed, which is starting to shed light on the crux of the debate as well as some insight into what courts are deciding. As of June 2024, there are some 24 active copyright lawsuits against AI companies in the US (and a number of others in the UK, the European Union, and elsewhere around the world). Some of the top cases include New York Times v. Microsoft, Concord Music Group, Inc. v. Anthropic PBC, Alter v. OpenAI, Getty Images v. Stability AI, among others.
This week, as a refresher and updated perspective from Konvoy, we are focusing specifically on the question of whether generative AI models can train on copyrighted data, and the implications for the gaming industry.
To properly assess and interpret these cases, it is important to understand the underlying legal framework and concepts of copyright law. In US law, “copyright is a type of intellectual property that protects original works of authorship as soon as an author fixes the work in a tangible form of expression” (US Copyright Office). Copyright is automatic for original pieces of work, and gives the owner exclusive right to reproduce, modify, adapt, distribute, and display the work, or grant others those rights.
When a work is not copyright protected, it is in the “public domain” where the public owns these works (rather than an individual or entity). Anyone can use public domain work without obtaining permission. Works that enter the public domain cannot be reappropriated. Most copyright works are generally protected for the life of the author plus 70 years, then they enter the public domain (US Copyright Office).
“Fair Use” is an exception: Even though original work is protected under copyright law, there is the concept of “Fair Use” in which copyrighted works can be leveraged without the consent of the owner with strict limitations. Fair Use is what the majority of generative AI companies are basing their claims on to leverage copyrighted works in their training data. The conditions for Fair Use are as follows:
While the conditions for Fair Use are strict, they are open to interpretation; for example, what constitutes a significant portion of the amount of a copyrighted work used, or whether the use will heavily impact the copyright owner. For generative AI models, the question becomes whether the content produced is substantially similar to a work that was ingested by the AI and whether or not it is a realistic substitute for the original work. An important point, though, is that ingestion of copyrighted works as training data alone is not illegal. What matters is whether the output of a model is “substantially similar” to a protected work (Association of Research Libraries).
In the case of New York Times v. Microsoft, the New York Times (NYT) lawsuit claims that Microsoft’s Copilot and OpenAI’s ChatGPT were trained on millions of NYT articles, and that the models generate outputs that recite NYT content verbatim, closely summarize it, and mimic their style; which ultimately offer a competing solution in the market. This claim is more substantial than the majority of previous claims against AI companies because it does not focus solely on training but rather alleges that the outputs are substantially similar to original NYT articles and that users could simply go to ChatGPT for their news and analysis rather than NYT (a significant commercial harm to NYT).
OpenAI disputes these claims as meritless, making the case that ChatGPT “regurgitation” is a rare bug they are working to fix and that the vast majority of outputs are substantially transformative in nature. They also make the claim that the NYT and OpenAI were in negotiations on a partnership for displaying attribution, where the NYT “would gain a new way to connect with their existing and new readers, and [OpenAI] users would gain access to their reporting.” This last point is attempting to make the argument that NYT content and reporting produced through ChatGPT would not negatively impact the NYT.
While court cases proceed, interest groups are forming to galvanize public opinion and shape common practices. Back in November 2023, the head of Stability AI’s audio team, Ed Newton-Rex, resigned with a public letter citing his disagreement with “the company’s opinion that training generative AI models on copyrighted works is ‘fair use’.” He then went on to found a non-profit called Fairly Trained that certifies generative AI models that do not use copyrighted work without a license. The non-profit has certified 14 models to date. Because they are not using copyrighted works under the Fair Use claim, these models are severely limited in the training data they can use, which curbs their applicability and competitiveness in the market. The bet they are making is that over time the law will side with the notion that Fair Use claims are not applicable for AI training data.
Fairly Trained is supported by organizations that represent creators or are content copyright owners that have an interest in controlling the ability of AI companies to leverage their content, such as Universal Music Group, Concord, SAG-AFTRA, Artists Rights Society, The Authors Guild, The Association of American Publishers, among others. There are large financial incentives on both sides of this debate; with AI and tech companies on one side pushing the boundaries of generative models and the creators and owners of content on the other.
While these generative AI copyright lawsuits play out over the coming months and years, public perception and interest groups will also have a significant role in influencing what becomes standard practice across copyright holders and AI companies (regardless of what the courts ultimately decide).
With the uncertainty on where the law will land, many platforms that want to use generative AI in their products are pursuing strategies that leverage proprietary, public domain, or licensed data while leaving the door open for adding in unlicensed training data in the future under Fair Use. In the gaming industry for example, Roblox rolled out generative AI tools (a code completion tool and materials generator) trained on assets expressly released for re-use by the Roblox community. Valve, which runs the dominant PC game store Steam, initially decided to block games with AI generated content from their store in June of 2023, but updated their policy in January of 2024 to simply require reporting of generative AI usage in game submissions. Many smaller game studios, without the concerns of being prime legal targets, are much more aggressive in integrating this technology into their workflows; Unity reported in March of 2024 that 62% of game studios they surveyed leveraged generative AI (Unity).
Takeaways: What is difficult with generative AI content is the sheer volume of unique outputs formed for individuals. Unlike past technologies (and any legal precedence), each output from a generative AI model is different. The nuance of whether a work is substantially transformative and whether it directly negatively impacts one or multiple copyrighted protected works theoretically must be assessed for each generated output, which may be prohibitively expensive to do. The courts are still working through these cases and while a lack of clarity on what the outcome will be remains, plaintiffs have brought more robust cases against generative AI companies relying primarily on substantial similarity to original works and the adverse commercial impact on copyright owners.
Our view is that leveraging copyrighted data without permission for training of models does not break copyright law in and of itself, but if a generated output is substantially similar to the original and provides a reasonable alternative for users then it is not protected under Fair Use. The question, which is up for interpretation by the courts and the public, is the degree of what constitutes substantial similarity.
Within the gaming industry, many companies are exploring leveraging generative AI out of competitive necessity, but they are doing so cautiously with the uncertain legal environment, and arguably more importantly, to avoid public and user backlash. This is a smart approach for large gaming companies as it protects them from both the legal and public perception ramifications of leveraging copyrighted works. Additionally, because of the complexity of creating games, we are unlikely to see full reproductions of games generated by AI without significant human involvement today, which offers a more robust Fair Use claim under the argument of substantial transformation.