The convergence of ground-breaking advancements in genomics and artificial intelligence, and the widespread adoption of electronic health records over the last few decades, has ushered in an era of ‘big data’ in medicine. A complex web of patient information, genetic insights and computational capabilities alongside new forms of analysis are now tasked with revolutionising the way we understand, diagnose and treat a range of diseases and conditions.
At the same time, ongoing debates about how to harness, digest and make sense of increasingly large and complex data sets highlight the importance of critically assessing how this relatively new era of medicine is shifting the ways we uncover evidence and assess the efficacy of new interventions.
This article considers the social implications and risks of a notable shift in current approaches to medical data collection and analysis: from hypothesis-driven data collection to data-first study design. In simple terms, this shift means that researchers now often focus on generating the largest datasets possible, rather than building datasets based on specific, testable hypotheses.
When data collection is driven by a hypothesis, scientists seek to enroll trial participants who meet certain inclusion and exclusion criteria. These studies are designed to test a single or small set of hypotheses about a condition or intervention. The data-first study method, instead, requires that researchers collect large amounts of biometric and demographic information from as many people as possible, without specific conditions, treatments or hypotheses in mind when the data is originally collected (e.g. the All of Us Program).
This change in approach has real consequences, including new forms of bias and discrimination, new ways of assessing the usefulness of an observation or intervention and new health policy priorities. Understanding and analysing these consequences is critical to responsibly ushering in and managing this new era of data-driven medicine.
Bias and discrimination are certainly not new problems in healthcare – standard models of clinical evidence, such as randomised controlled trials, have long been criticised for over-enrolling white, male participants from urban centres, leading to less effective care for underrepresented, minoritised populations. At the same time, the turn towards data-first evidence and knowledge creation introduces new forms of bias and discrimination that require ethical review and mitigation. Take, for example, one of the most common uses of AI-powered genomic prediction: polygenic risk scoring.
Polygenic risk scores use data on genotype (the genetic make-up of a person, which may be recorded in a genetic database) and phenotype (the observable physical traits of a person, which can be recorded in electronic health records data) to estimate (‘score’) an individual’s relative genetic predisposition to developing a disease compared to others in a population.
While AI-driven polygenic risk scoring faces old issues of bias due to a lack of ancestral diversity or representation in genetics databases, it also incorporates known biases in clinical care itself. These biases can be inscribed in the electronic health records data that is then used to power prediction models. Existing discrimination can even be exacerbated by AI models. For example, researchers analysing chest X-ray diagnosis models found that underdiagnosis was higher in already underserved populations because models can ‘amplify a known source of error in the process of data generation or data distribution’.
Thus, the data-first approach to medical study design must grapple not only with the inclusion of representative populations, but also the potential to exacerbate existing inequities by relying on data collected from discriminatory systems.
Data-first approaches to medical evidence also shift the way researchers make sense of and value new information emerging from large datasets.
As the National Academies of Sciences, Engineering, and Medicine reported in 2013, ‘Vast computational power (with associated sophistication of information technology) has become affordable and widely available. This capability makes it possible to harvest useful information from actual patient care (as opposed to one-time studies), something that previously was impossible.’ When researchers are tasked with sorting through ‘real-world evidence’ from continuously updated records of thousands of clinical encounters, there is no obvious beginning or endpoint to the task of gathering evidence, as would usually be the case in hypothesis-driven clinical trials. An immediate question then becomes: how useful is a newly-found piece of information and when should it be communicated to patients or clinicians?
To make sense of this, a new concept has proliferated in biomedical literature over the past 10 years: actionability. Actionability is a term particularly popular in clinical genomics, where it helps assess whether new data or technological innovation warrant action and reflects an urgency towards communicating information that may provide clinicians or patients with immediate or future benefit. This effectively means that researchers are faced with a choice to share information that may be helpful to patients or clinicians now, based on initial evidence – inferred, for instance, from studying a data-set – or to wait for expensive and time-consuming clinical trials to generate standard outcomes data.
The trade-off in focusing on actionability in fields like genomics is that it is often one step removed from the patient-centred outcomes that matter most. Actionability drives our attention primarily towards whether a test or piece of information can lead to action, not whether that action has proven benefits. In comparison to other ways of evaluating the importance of a new intervention, like clinical validity (whether the intervention provides accurate information) or utility (whether the intervention improves care), actionability may aim us towards overuse of interventions that have only limited or untested effectiveness.
Finally, the data-first approach to evidence generation in biomedicine leads to changing health policy priorities and goals.
As James Tabery has outlined in his recent book, Tyranny of the Gene, efforts to collect genetic and other biometric data from millions of research participants has trumped less technologically advanced or flashy approaches to public health that focus on environmental or social causes of illness and disease.
Tabery focuses on the particularly American obsession with individualism and precision medicine over collective solutions to improving health inequities and outcomes. However, his larger argument about the dangers of over-investing in genomic or other advanced health technologies is applicable across a range of geopolitical contexts. We do not need genomics or AI to address the most fundamental causes of poor health outcomes, and our over-emphasis on these arenas is likely to increase the costs of care for everyone.
New opportunities emerge alongside real consequences and potential harms, when large, evolving datasets, designed for general research purposes, become a key place for collecting new medical evidence. This is in contrast to formally designed clinical trials with specific hypotheses in mind and requires additional consideration. Researchers face an intensified version of a long-known problem with medical research: the streetlight effect, where they will be more likely to search for new evidence and knowledge only where it is easiest to look.
The data in these existing databases is easy to access and analyse but may not be best suited to answer the specific questions we want to answer, or may hold a range of biases that make their generalisability more tenuous. This approach may also shift how we understand the urgency of health data, and what levels of evidence are required before it is appropriate to offer new information or interventions to patients.
While the potential benefits of advances in genomics, AI and health information technologies are vast, we must continue to critically appraise the ways these technologies shift approaches to medical evidence. This is necessary to ensure that our health systems see as many of the benefits as possible, while minimising the potential for discrimination, unnecessary care or poor health policy.
This joint project with the Nuffield Council on Bioethics explores how AI is transforming the capabilities and practice of genomic science.
The role of genomics in the data-driven pandemic response