While synthetic data was considered less desirable than actual data, some now consider it a panacea. The real data is messy and full of bias. New data protection regulations make collection more difficult. In contrast, synthetic data is intact and can be used to build more diverse data sets. You can produce perfectly stamped faces, such as different ages, shapes, and ethnicities, to build a facial recognition system that works for different demographics.

However, synthetic data has its limitations. If it doesn’t reflect reality, it can lead to even worse artificial intelligence than messy, biased real-world data – or it can simply inherit the same problems. “What I don’t want to do is give a thumbs up to this paradigm and say, ‘Oh, this solves so many problems,'” says data scientist Cathy O’Neil and founder of the algorithmic audit firm ORCAA. “Because it also ignores many things.”

Realistic, not real

Deep learning has always been about knowledge. But in recent years, the AI ​​community has learned it good information is more important than major information. Even small amounts of correct, cleanly labeled data can do more to improve the performance of an artificial intelligence system than ten times larger than raw data or even a more advanced algorithm.

This will change the way companies develop their artificial intelligence models, says Ofir Chakon, CEO and founder of Datagen. Today, they start by gaining as much information as possible and then adjusting their algorithms to improve performance. Instead, they should do the opposite: use the same algorithm while improving the composition of their data.

Datagen also produces counterfeit furniture and indoor environments to put counterfeit people in context.


However, gathering real-life data to perform such an iterative experiment is too expensive and time-consuming. This is where Datagen comes in. With a synthetic data generator, teams can create and test dozens of new data sets per day to identify what maximizes model performance.

To ensure the realism of the data, Datagen provides its suppliers with detailed instructions on how many people are scanned in each age group, BMI area, and ethnic origin, as well as a list of activities to be performed on them, such as walking around the room or drinking soda. Vendors send both accurate static images and motion capture data for these functions. Datagen’s algorithms then extend the data to hundreds of thousands of combinations. The synthesized data is then sometimes rechecked. For example, a fake face is drawn relative to a real face to see if it looks realistic.

Datagen now produces expressions to monitor drivers ’alertness in smart cars, body movements to track customers at checkout stores, and irises and hand movements to improve eye and hand tracking on VR headsets. According to the company, its data has already been used to develop computer vision systems that serve tens of millions of users.

Not only synthetic people are mass-produced. Clicks is a startup that uses synthetic artificial intelligence to perform automated vehicle inspections. Using design software, it recreates all the makes and models that artificial intelligence needs to recognize, and then makes them different colors, damage, and deformations under different lighting conditions, against different backgrounds. This will allow the company to update its AI as carmakers introduce new models, and help avoid data breaches in countries where license plates are considered private data, so they cannot appear in photographs used to train AI.

Clicks add cars of different makes and models from different backgrounds.


Mostly.ai works with financial, telecommunications and insurance companies to provide spreadsheets with falsified customer data that allow companies to legally share their customer databases with external suppliers. Anonymization can reduce the richness of a data set, but still does not adequately protect people’s privacy. However, synthetic data can be used to produce detailed falsified data sets with the same statistical characteristics as the company’s actual data. It can also be used to simulate data that a company does not yet have, including a more diverse customer base or scenarios such as fraudulent activity.

Proponents of synthetic information say it can also help assess artificial intelligence. In recently published paper Published at the Artificial Intelligence Conference, Suchi Saria, an assistant professor of machine learning and health care at Johns Hopkins University, and his assistant writer showed how data generation techniques could be used to extrapolate different patient populations from a single data set. This can be useful if, for example, a company only has data on New York’s younger population, but would like to understand how its artificial intelligence works in an aging population with more diabetes. He is now setting up his own company, Bayesian Health, to use this technology to test medical artificial intelligence systems.

The limits of its counterfeiting

But is synthetic data over-warned?

In terms of privacy, “the fact that the data is“ synthetic ”and does not directly correspond to real user data does not mean that it does not encode sensitive information about real people,” says Aaron Roth, a professor of computer science at the University of Pennsylvania. Some data generation techniques have been shown to accurately produce images or text found in exercise data, for example, while others are vulnerable to attacks that cause them to retrieve that data.

This can be great for a company like Datagen, whose synthetic data is not meant to conceal the identities of individuals who agreed to the scan. But that would be bad news for companies offering their solutions as a way to protect sensitive financial or patient data.

Studies suggest that, in particular, the combination of two synthetic data –different privacy and generative competitive networks“Can provide the strongest privacy protections,” said Bernease Herman, a data scientist at the eScience Institute in Washington. But skeptics are concerned that this nuance may be lost in the marketing efforts of synthetic data providers, which do not always emerge from the technologies they use.


Please enter your comment!
Please enter your name here