Skip to content
Blog

Synthetic data, real harm

Why is synthetic data used and what challenges does it raise for AI assurance mechanisms?

Dani Shanley

18 September 2025

Reading time: 11 minutes

Augmented reality application using artificial intelligence for recognizing food

Whether it’s the viral trend ‘100x ChatGPT’ or John Oliver’s recent special on ‘AI slop’, discussions about AI feedback loops are difficult to ignore. When AI-generated content is fed back into an AI model, it doesn’t take long before we end up drowning in nonsense – say, an abstracted image of Dwayne ‘the Rock’ Johnson, which becomes so basic and devoid of character that it would have the Cubists spinning in their graves.

‘Habsburg AI’ – Jathan Sadowski’s neologism to compare AI systems trained on other AIs’ outputs to ‘inbred mutant[s]’ – and MAD AI models –  where MAD stands for ‘model autophagy disorder’, an AI mad-cow disease of sorts affecting models feeding off their own work – are issues of synthetic content. Synthetic data is a slightly different thing, but its use leads to comparable issues.

Instead of producing output that people can view, synthetic data is ‘generated using a purpose-built mathematical model or algorithm, with the aim of solving a (set of) data science tasks.’ Programmers deploy it to train AI models mostly to address three sticky problems within machine learning: data scarcity, data fairness and data privacy.

Synthetic data promises to generate unlimited training examples to fill gaps in underrepresented scenarios or edge cases, to produce more representative datasets that correct for historical biases, and to preserve the statistical properties of original datasets while stripping away personally identifiable information.

Besides AI developers, political economists mobilise the notion of synthetic data in conversations around democratising AI production. Here there are clear pros and cons to consider. On the one hand, using synthetic data might reduce dependence on tech giants’ massive datasets. On the other hand, it could further insulate data-making processes from public scrutiny and obfuscate the design choices embedded in data generation systems.

The problem is – and this is regardless of the discursive domain – just as generative AI models can produce plausible (but false) text or images, synthetic data generators may create datasets that appear statistically valid, while introducing subtle, hard-to-catch distortions and artificial patterns, or missing crucial real-world complexities.

Data pollution becomes insidious when synthetic datasets contaminate training pipelines, creating feedback loops with models learning from increasingly artificial representations of reality. Model collapse (when models trained on data generated by other AI models degrade in quality and diversity, until they produce distorted or nonsensical outputs) provides a cautionary tale for how synthetic data contamination could undermine the reliability of AI systems across entire domains. Put another way, synthetic data pollution and contamination share a striking family resemblance to AI hallucinations and slop.

As such, synthetic data demands specialised approaches to oversight and quality control that current AI governance and assurance frameworks aren’t fully equipped to handle.

If so-called ‘real’ data is never objective — but rather the product of subjective decisions about what to measure, when to collect, whom to include and how to categorise – synthetic data, voided of any concrete referents, makes this subjectivity more concentrated and less visible. This places unprecedented power in the hands of developers, who can shape reality through data design while potentially obscuring the constructed nature of these representations behind claims of algorithmic objectivity.

Data (synthetic or not) creates a (synthetic) representation of reality, introducing what researchers call a ‘simulation-to-reality gap’ — the inevitable disconnect between how datasets behave and how the real world functions – which presents opportunities and risks.

As the importance of synthetic data in AI development grows, we should examine not just how it is being used, but also when and why, and move beyond traditional auditing approaches. It is critical to develop comprehensive assurance systems that can address synthetic data’s distinctive challenges and prevent its promise from becoming a pathway to new forms of harm.

The promises and challenges of synthetic data

Data scarcity

Given the quantities of labelled data needed to develop effective AI, data scarcity remains a fundamental bottleneck. This is evident in domains like healthcare and finance, where collecting sufficient real-world examples can be difficult, risky, labour intensive or just too expensive.

Synthetic data appears to offer an elegant solution: augmenting sparse datasets with artificially generated examples, which preserve statistical properties while enabling unlimited scaling, at a low cost.

So far, so good. However, these promises are predicated upon problematic assumptions. First, they imply that synthetic data can accurately capture the distribution of real data. If a hospital’s patient records show that 60 per cent of heart-attack patients are over 65 and mostly male, synthetic data should reproduce these same patterns. However, creating data that maintains real-world, complex relationships between age, gender, medical history and outcomes is extremely difficult.

The second assumption is that generated samples will include the same edge cases and anomalies that models will encounter in deployment. A fraud detection system trained on synthetic credit card transactions that captures normal spending patterns but misses unusual scenarios, like someone buying petrol at 3 AM while abroad, is useless at best. Even if they are rare in datasets, edge cases are exactly what a system needs to recognise.

Finally, proponents of synthetic data use assume that its quality can be validated without extensive real-world testing. But to check if synthetic data is good enough, you need real data to compare it against. You can’t verify a translation without knowing the original language. However, if you already have enough real data to validate the synthetic version, why use synthetic data at all?

Data fairness

When data is scarce, certain groups tend to be at best underrepresented and at worst absent from training datasets. And in AI data processing, ‘garbage in, garbage out’ applies to social equity and not just data quality.

Herein lies another opportunity: using synthetic data to create fair and representative datasets. Developers can oversample underrepresented populations and remove discriminatory patterns from training data. If a facial recognition dataset contains predominantly white faces, developers could generate synthetic data of underrepresented racial groups with the same image quality and ‘engineer’ fairness into the training data, rather than correcting the model for bias after training.

Now, you don’t need to read a whole lot of AI criticism for this ‘fix’ to register as problematic.

First off, when datasets become polluted or contaminated, they are not just unreliable, but unrepresentative of the populations and phenomena they claim to model. Synthetic datasets will still be biased and incomplete, rendering AI systems not simply inaccurate but unfair, systematically disadvantaging certain groups and encoding and amplifying existing inequalities in ever more sophisticated ways.

Second, asking developers to ‘balance’ datasets or which fairness metrics to optimise for means asking an awful lot. We are tasking them with making value-laden choices that many trained ethicists would struggle with. And their choices will always reflect their own backgrounds, assumptions, blind spots and ideas about what constitutes fair representation.

Data privacy

Organisations struggle to balance the need for rich training data with data protection responsibilities.

Privacy-preserving machine learning approaches involve trade-offs. Differential privacy (adding calibrated noise to datasets or model outputs to prevent individual records from being identified) or federated learning (training models across local, decentralised devices or institutions and then only sharing model updates, omitting the raw data) improve security but often at the cost of model performance or development. At the same time, regulations like the GDPR can make collecting, storing and processing personal data for AI training difficult and expensive.

Once again, synthetic data appears to provide a solution. Training AI systems with it, a programmer can preserve statistical properties without using personal information and having to worry about regulatory compliance. Synthetic data seems to resolve the tension between data utility and privacy protection, turning the simulation-to-reality gap into a privacy feature.

However, the quest for high fidelity, that is achieving a high statistical match between synthetic and real-world data, creates quite the paradox. The more realistic and useful synthetic data becomes, the greater the risk that it inadvertently reveals private information. If synthetic medical data captures the rare combination of a 45-year-old with a specific genetic condition living in a small town, it might recreate enough detail to reidentify the original patient. As the simulation-to-reality gap narrows and synthetic data becomes more faithful to real-world patterns, privacy protection is eroded.

In fact, the concentrated information that synthetic-data generation systems include on the population they model could also make them attractive targets for cyberattacks.

Quality assurance mechanisms for synthetic data

What is ‘good’ synthetic data, and how should we use it?

Despite these challenges, possibly contributing to the argument that using synthetic data can never be ethical, Apple, Microsoft, Google, Meta, OpenAI, and IBM (among others) are already doing it.

Questions about what constitutes ‘good’ synthetic data show a troubling gap between aspiration and practice. Quality assurance practitioners acknowledge the critical importance of data validation (checking that data meets expected quality, format, and accuracy standards), but struggle to define what makes synthetic data useful and trustworthy. Current validation practices often amount to informal ‘spot-checking’ or ‘eyeballing’ instead of systematic evaluation, which reflects the immaturity of standards in this rapidly evolving field.

Not only does synthetic data require thorough review, careful curation, and systematic filtering, like any other dataset, it also requires additional validation layers, ideally paired with fresh real-world data to confirm synthetic assumptions and fill gaps that artificial generation might miss.

Without quality assurance frameworks specifically designed for synthetic data’s unique challenges, organisations risk deploying models trained on flawed datasets, undermining both performance and fairness objectives.

Much of the work required has well-established precedents within discussions about AI ethics, quality assurance mechanisms, regulation and governance.

Transparency standards

Policymakers working on data standards will need to mandate clear documentation of how synthetic data is generated, what synthetic-specific validation it has undergone, and how it will be used.

Synthetic data transparency requirements must also be tailored to address areas such as: the use of synthetic validation data, which creates a ‘hall of mirrors’ problem, where synthetic data is checked against other synthetic data; potential circular validation problems; privacy protection measures; re-identification risks; and provenance tracking that shows which real datasets were used to create the synthetic version, ensuring the entire process can be audited also long after the model is adopted.

Independent auditing frameworks

Auditors will need to create frameworks focusing on the data generation algorithms themselves, rather than just the final datasets. Independent real-world testing, using the model on actual data, will be necessary to establish if performance drops and detect circular validation, where both training and test data share synthetic origins. Adversarial techniques will help uncover hidden biases or privacy vulnerabilities that traditional statistical validation might miss.

Unlike standard data audits that focus on collection and curation processes, synthetic data auditing demands multidisciplinary expertise in generation algorithms, privacy-preserving techniques, and the unique failure modes of synthetic data systems.

Ethical training

Ethical training in this context requires specialised focus on the unprecedented power developers wield to construct reality through data design. This means moving beyond the traditional data ethics’ focus on data collection and use.

Interdisciplinary support must include not just ethicists and domain experts, but specialists in synthetic data generation, and privacy-preserving techniques.

Public engagement

Researchers and public organisations will need to engage communities with how they are represented in synthetic datasets, including the algorithmic choices about what constitutes a ‘fair’ and ‘accurate’ representation of their experiences.

Public organisations will have to grapple with whether synthetic data can indeed democratise AI development, bypassing large companies’ data resources, or will just replace real-world data relationships with ever more obscure algorithmic intermediaries. This will require new forms of participation and engagement.

Effective governance

Unlike traditional data governance that focuses on the collection, storage and use of existing information, synthetic data governance must address the power to create new ‘data realities’.

Regulators will need to ensure that laws apply not only to the use of data but to the algorithmic construction of reality through synthetic data. This requires overseeing the design choices embedded in data generation systems.

Finally, policymakers should establish governance mechanisms to ensure synthetic data serves broader social interests rather than simply enabling more efficient value extraction from limited real-world information.

Synthetic data is no panacea. Its most promising applications may be those that acknowledge the simulation-to-reality gap as a feature rather than a bug, using synthetic data for stress-testing, exploration and augmentation while requiring explicit disclosure, real-world validation and ongoing monitoring of performance gaps.

At the same time, we shouldn’t forget to ask for each instance of use what is the problem that synthetic data is intended to solve? Technical questions about data quality, quickly become about justice, fairness and human rights. We need to remain attentive to synthetic data’s potential to obscure rather than solve underlying data quality problems, the roots of which are rarely technical. They are social and political all the way down.

Related content