Newsletter

Jun 21, 2024

The AI Copyright Battle Intensifies

The legality of using copyrighted training data for generative AI models

Copy Link

Content

AI/ML

Copy Link

The Legality of Training Data for Generative AI Models

There are a multitude of concerns around content and artificial intelligence (AI), including deep fakes and the impact on public misinformation, whether generative AI works can be copyrighted, privacy concerns for personal data, and what can be freely used as training data in LLMs and other generative models. In August 2023, we explored the implications of copyright law and generative AI content on the gaming industry, and came to the conclusion that “we expect a majority of the copyright infringement cases against generative AI companies to fail in the courts and their activities will be protected primarily under the ‘transformative’ character of the text and images produced.” Transformative uses are more likely to be considered fair by US copyright law and are defined as, “uses [that] add something new, with a further purpose or different character, and do not substitute for the original use of the work” (US Copyright Office).

Since then, a number of cases have progressed through the courts and new lawsuits have been filed, which is starting to shed light on the crux of the debate as well as some insight into what courts are deciding. As of June 2024, there are some 24 active copyright lawsuits against AI companies in the US (and a number of others in the UK, the European Union, and elsewhere around the world). Some of the top cases include New York Times v. Microsoft, Concord Music Group, Inc. v. Anthropic PBC, Alter v. OpenAI, Getty Images v. Stability AI, among others.

This week, as a refresher and updated perspective from Konvoy, we are focusing specifically on the question of whether generative AI models can train on copyrighted data, and the implications for the gaming industry.

Understanding the Legal Framework

To properly assess and interpret these cases, it is important to understand the underlying legal framework and concepts of copyright law. In US law, “copyright is a type of intellectual property that protects original works of authorship as soon as an author fixes the work in a tangible form of expression” (US Copyright Office). Copyright is automatic for original pieces of work, and gives the owner exclusive right to reproduce, modify, adapt, distribute, and display the work, or grant others those rights.

When a work is not copyright protected, it is in the “public domain” where the public owns these works (rather than an individual or entity). Anyone can use public domain work without obtaining permission. Works that enter the public domain cannot be reappropriated. Most copyright works are generally protected for the life of the author plus 70 years, then they enter the public domain (US Copyright Office).

“Fair Use” is an exception: Even though original work is protected under copyright law, there is the concept of “Fair Use” in which copyrighted works can be leveraged without the consent of the owner with strict limitations. Fair Use is what the majority of generative AI companies are basing their claims on to leverage copyrighted works in their training data. The conditions for Fair Use are as follows:

While the conditions for Fair Use are strict, they are open to interpretation; for example, what constitutes a significant portion of the amount of a copyrighted work used, or whether the use will heavily impact the copyright owner. For generative AI models, the question becomes whether the content produced is substantially similar to a work that was ingested by the AI and whether or not it is a realistic substitute for the original work. An important point, though, is that ingestion of copyrighted works as training data alone is not illegal. What matters is whether the output of a model is “substantially similar” to a protected work (Association of Research Libraries).

Cases Against Fair Use Are Improving

In the case of New York Times v. Microsoft, the New York Times (NYT) lawsuit claims that Microsoft’s Copilot and OpenAI’s ChatGPT were trained on millions of NYT articles, and that the models generate outputs that recite NYT content verbatim, closely summarize it, and mimic their style; which ultimately offer a competing solution in the market. This claim is more substantial than the majority of previous claims against AI companies because it does not focus solely on training but rather alleges that the outputs are substantially similar to original NYT articles and that users could simply go to ChatGPT for their news and analysis rather than NYT (a significant commercial harm to NYT).

OpenAI disputes these claims as meritless, making the case that ChatGPT “regurgitation” is a rare bug they are working to fix and that the vast majority of outputs are substantially transformative in nature. They also make the claim that the NYT and OpenAI were in negotiations on a partnership for displaying attribution, where the NYT “would gain a new way to connect with their existing and new readers, and [OpenAI] users would gain access to their reporting.” This last point is attempting to make the argument that NYT content and reporting produced through ChatGPT would not negatively impact the NYT.

While court cases proceed, interest groups are forming to galvanize public opinion and shape common practices. Back in November 2023, the head of Stability AI’s audio team, Ed Newton-Rex, resigned with a public letter citing his disagreement with “the company’s opinion that training generative AI models on copyrighted works is ‘fair use’.” He then went on to found a non-profit called Fairly Trained that certifies generative AI models that do not use copyrighted work without a license. The non-profit has certified 14 models to date. Because they are not using copyrighted works under the Fair Use claim, these models are severely limited in the training data they can use, which curbs their applicability and competitiveness in the market. The bet they are making is that over time the law will side with the notion that Fair Use claims are not applicable for AI training data.

Fairly Trained is supported by organizations that represent creators or are content copyright owners that have an interest in controlling the ability of AI companies to leverage their content, such as Universal Music Group, Concord, SAG-AFTRA, Artists Rights Society, The Authors Guild, The Association of American Publishers, among others. There are large financial incentives on both sides of this debate; with AI and tech companies on one side pushing the boundaries of generative models and the creators and owners of content on the other.

While these generative AI copyright lawsuits play out over the coming months and years, public perception and interest groups will also have a significant role in influencing what becomes standard practice across copyright holders and AI companies (regardless of what the courts ultimately decide).

Gaming Companies Are Approaching Generative AI With Caution

With the uncertainty on where the law will land, many platforms that want to use generative AI in their products are pursuing strategies that leverage proprietary, public domain, or licensed data while leaving the door open for adding in unlicensed training data in the future under Fair Use. In the gaming industry for example, Roblox rolled out generative AI tools (a code completion tool and materials generator) trained on assets expressly released for re-use by the Roblox community. Valve, which runs the dominant PC game store Steam, initially decided to block games with AI generated content from their store in June of 2023, but updated their policy in January of 2024 to simply require reporting of generative AI usage in game submissions. Many smaller game studios, without the concerns of being prime legal targets, are much more aggressive in integrating this technology into their workflows; Unity reported in March of 2024 that 62% of game studios they surveyed leveraged generative AI (Unity).

Takeaways: What is difficult with generative AI content is the sheer volume of unique outputs formed for individuals. Unlike past technologies (and any legal precedence), each output from a generative AI model is different. The nuance of whether a work is substantially transformative and whether it directly negatively impacts one or multiple copyrighted protected works theoretically must be assessed for each generated output, which may be prohibitively expensive to do. The courts are still working through these cases and while a lack of clarity on what the outcome will be remains, plaintiffs have brought more robust cases against generative AI companies relying primarily on substantial similarity to original works and the adverse commercial impact on copyright owners.

Our view is that leveraging copyrighted data without permission for training of models does not break copyright law in and of itself, but if a generated output is substantially similar to the original and provides a reasonable alternative for users then it is not protected under Fair Use. The question, which is up for interpretation by the courts and the public, is the degree of what constitutes substantial similarity.

Within the gaming industry, many companies are exploring leveraging generative AI out of competitive necessity, but they are doing so cautiously with the uncertain legal environment, and arguably more importantly, to avoid public and user backlash. This is a smart approach for large gaming companies as it protects them from both the legal and public perception ramifications of leveraging copyrighted works. Additionally, because of the complexity of creating games, we are unlikely to see full reproductions of games generated by AI without significant human involvement today, which offers a more robust Fair Use claim under the argument of substantial transformation.

From the newsletters

Newsletter

Apr 18, 2025

AI Guardians: Nurturing Young Minds

AI's future relies on being able to be fine-tuned to the user's needs

Newsletter

Apr 11, 2025

IP Licensing: Weathering the Storm

Licensing IP in games will increasingly be used in a world of increased competition

Newsletter

Apr 4, 2025

Evolution of Console Business Models

How console business models have evolved since the 1970s

Newsletter

Mar 28, 2025

The Lifeblood of Robotics

Robotics is expected to intersect with gaming in multiple ways.

Newsletter

Mar 14, 2025

PC Gaming Challenges, Unpacked

The PC gaming market faces difficult headwinds in the coming years

Newsletter

Mar 7, 2025

Praying For Hits: Amazon Bets On Religion

The House of David series is currently #2 on Amazon Prime and is the beginning of a religious content wave across entertainment

Newsletter

Feb 28, 2025

Agentic Advertising

Personal agents will filter ad content and recommendations for users

Newsletter

Feb 21, 2025

Empathetic Machines

Gaming could benefit from measuring human emotion

Newsletter

Feb 14, 2025

The Sound of Music

Innovation in music in gaming unlikely in the coming years

Newsletter

Feb 7, 2025

Gaming Will Revitalize Consumer Investing

Gaming will be responsible for kicking off the next wave of consumer investing

Newsletter

Jan 31, 2025

Gaming Subscriptions Are Losing Their Value

Player trends are not aligned with subscription economics in gaming

Newsletter

Jan 24, 2025

Switch 2 Expectations

Switch 2 likely to sell 25-40% less than the Switch 1

Newsletter

Jan 10, 2025

Game Engines & Synthetic Data

Game engines will help with the AI data shortage

Newsletter

Jan 3, 2025

The Breakout Gaming Companies of 2024

The 2024 breakout gaming companies in Content and Tech & Platform

Newsletter

Dec 27, 2024

2024

A year in review

Newsletter

Dec 20, 2024

What is the Internet?

The internet is an amalgamation of ever-evolving networks

Newsletter

Dec 13, 2024

Edutainment and the Digital Native Generation‍

Benefits and challenges of game-based learning for the digital native generation

Newsletter

Dec 6, 2024

Collectibles ($142bn)

Video game collectibles market overview

Newsletter

Nov 23, 2024

Oil to Games: The Great Transition

Saudi Arabia’s rush to become the metropole of entertainment

Newsletter

Nov 15, 2024

AI Won’t Save Mediocre Games

AI’s problems are a problem for gaming

Newsletter

Nov 8, 2024

AppLovin Should Buy Unity

AppLovin needs data to keep growing

Newsletter

Nov 1, 2024

Where Have All the Kids’ MMOs Gone?

Past technical challenges around scale and safety can be largely solved today

Newsletter

Oct 25, 2024

Epic Games (33 Years)

Most likely to stay private for longer, IPO is unlikely in the near term

Newsletter

Oct 11, 2024

Ubisoft's Future

Ubisoft has lost its competitive edge (investors are watching, players keep waiting)

Newsletter

Oct 4, 2024

Roblox Is Not Aging Up

Roblox audience is misrepresenting their age on the platform

Newsletter

Sep 27, 2024

Breaking the Mold

Time for a new business model in games to emerge

Newsletter

Sep 20, 2024

Local AI’s Impact on Gaming

What on-device inference means for AI in games

Newsletter

Sep 13, 2024

Open Source: Brace for Impact

90% of corporations use open source software and the cost is going up

Newsletter

Sep 6, 2024

The Web: Tearing Down the Walled Garden

The web is the largest network; it will be the next distribution platform

Newsletter

Aug 30, 2024

Failure to Launch: The Series A Crunch

Gaming companies have a lower success rate moving from Seed to Series A vs industry average

Interested in our Newsletters?

Click

here

to see them all

FOLLOW US ON SOCIAL!

The AI Copyright Battle Intensifies

The Legality of Training Data for Generative AI Models

Understanding the Legal Framework

Cases Against Fair Use Are Improving

Gaming Companies Are Approaching Generative AI With Caution

From the newsletters

AI Guardians: Nurturing Young Minds

IP Licensing: Weathering the Storm

Evolution of Console Business Models

The Lifeblood of Robotics

PC Gaming Challenges, Unpacked

Praying For Hits: Amazon Bets On Religion

Agentic Advertising

Empathetic Machines

The Sound of Music

Gaming Will Revitalize Consumer Investing

Gaming Subscriptions Are Losing Their Value

Switch 2 Expectations

Game Engines & Synthetic Data

The Breakout Gaming Companies of 2024

2024

What is the Internet?

Edutainment and the Digital Native Generation‍

Collectibles ($142bn)

Oil to Games: The Great Transition

AI Won’t Save Mediocre Games

AppLovin Should Buy Unity

Where Have All the Kids’ MMOs Gone?

Epic Games (33 Years)

Ubisoft's Future

Roblox Is Not Aging Up

Breaking the Mold

Local AI’s Impact on Gaming

Open Source: Brace for Impact

The Web: Tearing Down the Walled Garden

Failure to Launch: The Series A Crunch

FOLLOW US ON SOCIAL!