We have written in the past about how game engines are not a singular software but rather an amalgamation of different software applications coming together in a fluid package to help game developers move from ideation to distribution. These “engines” are not simply a pre-packaged set of tools and are often a collection of first and third-party custom-made tools or integrated open-source code. Developers choose an engine and its components based on the output they are trying to achieve. While game engines like Unity and Unreal Engine are most well-known for the games they power, they have recently been utilized to create movies, digital twins, and real-life simulations. This week, we want to cover another major use case for game engines: synthetic data creation.
Synthetic data is a class of data created by a computer rather than a real-world event. For example, a generative AI model could create dozens of images of a stop sign in order to train self-driving cars. Gartner believes this type of data will likely overshadow “real data” for AI model training by 2030 (Gartner).
Many of today's most powerful AI models are text-based, hence the name: large language models (LLMs). These models train on trillions of words to derive patterns and infer outputs to prompts. Each of these words is broken down into digestible components called tokens. These tokens usually equate to ~0.8 words. It is estimated that over 3,100 trillion tokens are available for training on the whole web. It is also estimated that public text data will be exhausted by the end of the decade (arXiv). Given the impending data scarcity, the industry has considered synthetic data a potential solution.
Depending on the use case, “real data”, which naturally occurs in the world, can be slow, expensive, or dangerous to capture. Additionally, this data can be noisy (polluted with less helpful or irrelevant data), have privacy or access concerns, or sometimes be impossible to capture. For example, there are no real datasets for landing humans on Mars.
On the other hand, synthetic data can have multiple benefits:
One of the key benefits of synthetic data is augmentation: the ability to supplement existing real data to make it more accurate (eliminate biases) or fill in gaps (add additional synthetic data points) that may not have occurred in nature. For example, in the case of detecting fraud in financial markets, data may be scarce as it relates to a specific new market or demographic. Using synthetic data to augment scarce data has improved model accuracy. For example, Experian, a multinational data analytics and consumer credit reporting company, was able to improve its model accuracy for credit risk decisioning by over 10% (Experian).
Some industries where data security is crucial, such as finance and healthcare, are taking a different approach to synthetic data by using it to replicate real-life datasets with fake information while still keeping the patterns within the data. This allows the synthetic data to be shared more freely, interpreted, and analyzed broadly, while keeping the specific and real individual data safe. This helps provide model training access to what has historically been sensitive private datasets. Approaches like these could help fill in the gap for publicly available data that has largely been used up when training today’s models (MIT).
Despite the potential benefits, this data is not a perfect representation of the natural world and has the potential to overlook certain criteria or situations that could impact the model's outcome. This causes data quality and validation to become some of the largest concerns around synthetic data. Additionally, privacy is not a guarantee and generating synthetic datasets must be done carefully to ensure that privacy standards are upheld.
Game engines provide a unique environment in which synthetic data can be created. Game engines today are capable of immense calculations, real-time physics, rendering, and the ability to iterate rapidly. Game engines have been evolving with the intention of building more and more realistic games and are therefore also suited for creating realistic life-like environments that can be used to simulate scenarios for AI model training. In this scenario, the goal for the engine is no longer to make the most engaging game, but the most realistic simulation of a real-world environment.
Digital twins: Game engines have been increasingly used to simulate the real world. For example, Unreal Engine has been used to create a range of digital twins (digital representations of physical objects or environments):
Data augmentation: Because of their high fidelity nature and breadth of simulation capabilities, synthetic data from game engines are increasingly being used to train AI models to augment datasets or in some cases act as foundational data.
Some examples include:
Not only is this exciting because of its impact on the efficiency of training AI models, but this use case is likely to create a feedback loop for Unreal Engine (and other game engines) to be more accurate across all of its use cases (network effects). As we mentioned earlier, engines are amalgamations of different software and as new use cases emerge, the community builds new software to better cater to those use cases. Imagine that a new software emerges that better predicts the movement of our solar system for various scientific purposes; this software could also be used to make more realistic space-based adventure games.
One problem with engine-generated synthetic data is the cumbersome and intense setup process. Not only do users need to learn how to use a game engine, but they need to refine inputs, adjust environmental values, and test its accuracy against real-world environments. Some simulations may require bespoke software to improve performance or allow the engine to properly replicate a real world scenario. This can be a large challenge, especially for users outside of games who are not familiar with game engines. We are excited to see emerging technology such as AI-enabled or low-code game engines emerge that could help automate and democratize these simulations, allowing more companies to access synthetic datasets to improve their products and services.
Takeaway: Game engines like Unreal and Unity, originally designed to power immersive gaming experiences, have expanded their utility into filmmaking, digital twins, and synthetic data creation. Synthetic data, generated by computers rather than real-world events, addresses the growing scarcity and limitations of real-world data for AI model training. It offers cost-effective, customizable, and readily labeled datasets while helping to avoid privacy and ethical concerns. Game engines excel in creating realistic, high-fidelity simulations, which are helpful for training computer vision models and making dynamic simulations. This evolution not only improves AI models but also fuels a flywheel for the engines themselves, creating new tools and capabilities that enhance their use across both the real world and gaming.