Konvoy & Rocket Science Game Developer Survey is Open,
Participate Here

Konvoy’s Weekly Newsletter:

Your go-to for the latest industry insights and trends. Learn more here.

Newsletter

|

Jan 10, 2025

Game Engines & Synthetic Data

Game engines will help with the AI data shortage

Copy Link

Copy Link

Game Engines & Synthetic Data

We have written in the past about how game engines are not a singular software but rather an amalgamation of different software applications coming together in a fluid package to help game developers move from ideation to distribution. These “engines” are not simply a pre-packaged set of tools and are often a collection of first and third-party custom-made tools or integrated open-source code. Developers choose an engine and its components based on the output they are trying to achieve. While game engines like Unity and Unreal Engine are most well-known for the games they power, they have recently been utilized to create movies, digital twins, and real-life simulations. This week, we want to cover another major use case for game engines: synthetic data creation.

Benefits of Synthetic Data

Synthetic data is a class of data created by a computer rather than a real-world event. For example, a generative AI model could create dozens of images of a stop sign in order to train self-driving cars. Gartner believes this type of data will likely overshadow “real data” for AI model training by 2030 (Gartner).

Many of today's most powerful AI models are text-based, hence the name: large language models (LLMs). These models train on trillions of words to derive patterns and infer outputs to prompts. Each of these words is broken down into digestible components called tokens. These tokens usually equate to ~0.8 words. It is estimated that over 3,100 trillion tokens are available for training on the whole web. It is also estimated that public text data will be exhausted by the end of the decade (arXiv). Given the impending data scarcity, the industry has considered synthetic data a potential solution.

Depending on the use case, “real data”, which naturally occurs in the world, can be slow, expensive, or dangerous to capture. Additionally, this data can be noisy (polluted with less helpful or irrelevant data), have privacy or access concerns, or sometimes be impossible to capture. For example, there are no real datasets for landing humans on Mars.

On the other hand, synthetic data can have multiple benefits:

  • Can be created on-demand in limitless quantities
  • Is customizable
  • Is cheaper to acquire
  • Is produced pre-labeled (i.e., has been designated specific categories or tags that provide context for training)
  • Is not “real” (which mitigates security or ethical concerns)

One of the key benefits of synthetic data is augmentation: the ability to supplement existing real data to make it more accurate (eliminate biases) or fill in gaps (add additional synthetic data points) that may not have occurred in nature. For example, in the case of detecting fraud in financial markets, data may be scarce as it relates to a specific new market or demographic. Using synthetic data to augment scarce data has improved model accuracy. For example, Experian, a multinational data analytics and consumer credit reporting company, was able to improve its model accuracy for credit risk decisioning by over 10% (Experian).

Some industries where data security is crucial, such as finance and healthcare, are taking a different approach to synthetic data by using it to replicate real-life datasets with fake information while still keeping the patterns within the data. This allows the synthetic data to be shared more freely, interpreted, and analyzed broadly, while keeping the specific and real individual data safe. This helps provide model training access to what has historically been sensitive private datasets. Approaches like these could help fill in the gap for publicly available data that has largely been used up when training today’s models (MIT).

Despite the potential benefits, this data is not a perfect representation of the natural world and has the potential to overlook certain criteria or situations that could impact the model's outcome. This causes data quality and validation to become some of the largest concerns around synthetic data. Additionally, privacy is not a guarantee and generating synthetic datasets must be done carefully to ensure that privacy standards are upheld.

From Textbooks to Game Engines

Game engines provide a unique environment in which synthetic data can be created. Game engines today are capable of immense calculations, real-time physics, rendering, and the ability to iterate rapidly. Game engines have been evolving with the intention of building more and more realistic games and are therefore also suited for creating realistic life-like environments that can be used to simulate scenarios for AI model training. In this scenario, the goal for the engine is no longer to make the most engaging game, but the most realistic simulation of a real-world environment.

Digital twins: Game engines have been increasingly used to simulate the real world. For example, Unreal Engine has been used to create a range of digital twins (digital representations of physical objects or environments):

  • The Changi Airport in Singapore (one of the busiest airports in the world for international travel) was replicated using the Unreal Engine. This simulated digital twin incorporates real-time sensors, allowing operators to use live data from the airport, including plane locations, to inform their model. Over time, they plan to monitor arrival and departure times, humidity, and temperature, creating one centralized information hub (Unreal Engine).
  • Another example is the dynamic and real-time model of Wellington, which incorporates sensors, geospatial data, building infrastructure, and online data. The intention of this model is to help decision-makers make better, more informed decisions around things like climate change and economic development (Unreal Engine).

Data augmentation: Because of their high fidelity nature and breadth of simulation capabilities, synthetic data from game engines are increasingly being used to train AI models to augment datasets or in some cases act as foundational data.

Some examples include:

  • Computer Vision (CV): Some companies are using game engines to create images and videos to better help their computer vision models adapt to edge cases. For example, if you were trying to create a computer vision model that can identify a box of cereal, you could use all of the public images of cereal boxes, but that may not include an image of a cereal box in an office setting or on a spaceship, creating a blindspot in your model. In a game engine, you could create this scenario from dozens of perspectives (Duality.ai, IndiaGDC).
  • Simulation: It is possible to create variations of different scenarios to capture different data based on different variables. For example, Unreal Engine’s Lumen is a tool to create dynamic lighting. This could be used to run hundreds of different scenarios attempting to understand the impact of weather on solar panels and what that means for energy production. Dozens of different tools such as MetaHuman (high-fidelity digital characters) and Chaos Destruction (physics system customized for destruction of buildings) can be used to create high-quality simulations and derive different data from different variations of these simulations.

Not only is this exciting because of its impact on the efficiency of training AI models, but this use case is likely to create a feedback loop for Unreal Engine (and other game engines) to be more accurate across all of its use cases (network effects). As we mentioned earlier, engines are amalgamations of different software and as new use cases emerge, the community builds new software to better cater to those use cases. Imagine that a new software emerges that better predicts the movement of our solar system for various scientific purposes; this software could also be used to make more realistic space-based adventure games.

One problem with engine-generated synthetic data is the cumbersome and intense setup process. Not only do users need to learn how to use a game engine, but they need to refine inputs, adjust environmental values, and test its accuracy against real-world environments. Some simulations may require bespoke software to improve performance or allow the engine to properly replicate a real world scenario. This can be a large challenge, especially for users outside of games who are not familiar with game engines. We are excited to see emerging technology such as AI-enabled or low-code game engines emerge that could help automate and democratize these simulations, allowing more companies to access synthetic datasets to improve their products and services.

Takeaway: Game engines like Unreal and Unity, originally designed to power immersive gaming experiences, have expanded their utility into filmmaking, digital twins, and synthetic data creation. Synthetic data, generated by computers rather than real-world events, addresses the growing scarcity and limitations of real-world data for AI model training. It offers cost-effective, customizable, and readily labeled datasets while helping to avoid privacy and ethical concerns. Game engines excel in creating realistic, high-fidelity simulations, which are helpful for training computer vision models and making dynamic simulations. This evolution not only improves AI models but also fuels a flywheel for the engines themselves, creating new tools and capabilities that enhance their use across both the real world and gaming.

From the newsletters

View more
Left Arrow
Right Arrow

Interested in our Newsletters?

Click

here

to see them all