Cancer Data Science Pulse
Trusting the Data—A Look at Data Bias
From molecular signatures and clinical characteristics to nationwide statistics, data are helping to frame new research questions, train algorithms, pinpoint diagnoses, and make cancer treatment more precise.
Yet, with this increasing reliance on data, there’s also growing skepticism. More and more, studies uncover (often unintentionally) bias in these data, prompting the question: Can we truly trust the data?
Vivian Ota Wang, Ph.D., lead of the Policy, Ethics, and COVID Activities Unit in NIH’s Office of Data Science Strategy has nearly 30 years of experience in bringing ethics and equity to NCI and NIH in the research, policy, and workforce arenas.
In this Q&A, she shares her perspectives on data bias and outlines ideas for making data more equitable, fair, and useful to the greatest number of people.
We hear a lot about data bias, but not everyone may know what that means. How would you define data bias?
A top-level definition is data that aren’t a true reflection of what we’re measuring. We might have omitted certain variables or included data that contains human bias (e.g., against subgroups of people). Within a population, we can find bias in ancestry (as in genomic data); demographics (e.g., sex, gender, age, language, disability status, region of the country, rural versus urban); socioeconomics (e.g., income, educational attainment, access to healthcare); or methodological issues, such as how a disease is measured, or how treatment outcome is defined.
With the advent of Artificial Intelligence (AI), this definition gets even more complicated. Take, for example, the large language models, such as ChatGPT. Scientists initially trained this program using massive amounts of internet data, both correct and incorrect. But the data are only part of the issue. Another problem is how the model handles incomplete or ambiguous questions. Rather than clarifying, ChatGPT simply “guesses” what the question means, which often leads to unintended responses. Limiting bias in these high-impact technologies enough to make them safe for use in the biomedical field is a huge hurdle and one that’s being addressed by companies around the world.
What causes data bias?
We need data that’s inclusive and balanced if we’re to draw assumptions and conclusions that reflect real-life experiences. Otherwise, bias can result in a range of downstream consequences, from minor inconveniences, such as an inability to open your cell phone, to serious health risks, such as when disease is misdiagnosed.
For example, in one study using data from 3,618 people in NCI’s The Cancer Genome Atlas Program, the researchers found that, on average, genetic tests to predict the efficacy of certain cancer treatments weren’t as effective for people of African or Asian ancestry compared with those of European ancestry. Similarly, Hsu and colleagues tested the performance of an AI model designed to screen for breast cancer. That model, trained using a large data set of routine screenings from predominantly White populations, wasn’t nearly as robust in women with a prior history of breast cancer and Hispanic women.
Clearly genetic data need more diversity to be most accurate. Are there other types of data that tend toward bias?
Really no type of data is immune to bias. For example, Buolamwini and Gebur found that AI facial recognition applications were more likely to produce errors when “reading” women compared with men and performed better with lighter versus darker skinned individuals. Researchers unwittingly introduced systematic errors when collecting the data. They used default camera settings for the photographs, leading to better exposures for those with lighter versus darker pigmented skin. The way they posed the source subjects, resolution, and illumination also helped skew the results.
Although not intentional, the researchers had a procedural bias that was deeply rooted in their model.
So, researchers may create bias without even knowing it. Are there other ways bias can inadvertently creep into research or health care?
Time is another consideration because data can change over time. Take for instance a tool that was developed to measure a patient’s risk for sepsis. That tool allows physicians to intervene early in the course of disease before serious complications develop. In assessing the tool, however, Henry and colleagues found that not all data pertaining to sepsis were created equal. Earlier coding practices tended toward severe cases. This, in turn, led to bias in the data.
We need to think about how we catalog data and what ontologies we’re going to use. We need to consider how different staff (data scientists and biomedical researchers versus clinicians) approach the collection, processing, and use of data. All these factors contribute to data bias.
What are the implications of data bias?
Data bias has deep human implications and ramifications for both research and clinical care. Consider AI/machine-learning (ML) applications, such as those designed to identify skin cancer, which were trained only using lighter-skinned melanoma patients. Those applications often miss or misdiagnose cases in patients with darker complexions. The FDA is taking steps to eliminate bias like this, including developing an action plan to address AI and ML issues. We need to ensure these tools lead to accurate diagnostics. In terms of drug development, it’s clear that some people respond better to certain medications. It’s vital our clinical trials include the full spectrum of people so we can identify which medications work best for each person. This is the cornerstone of precision medicine.
What’s the first step in eliminating data bias?
The first step is to recognize that large-scale data are not created equal, as noted by Tasci and colleagues. Rather than assuming there’s no bias in the data, we need to be critical of data and its origins. Once we recognize bias exists, we can identify data sets with known issues, alert those using them, and take steps to mitigate the risks.
Of course, it would be ideal to eliminate the potential for bias right from the start. We can do that by recruiting a diverse population into research studies and clinical trials. But this is easier said than done. For decades, there’s been a general mistrust of science among some members of the population, and the recent COVID-19 pandemic has further increased skepticism and distrust in some communities. People also may have privacy concerns about how their data are used and shared. Unfortunately, these issues often impact disadvantaged or marginalized groups the most.
Diversity in our data is a well-recognized need, but the solutions have been slow to come to fruition. That said, there are successful programs that are helping to recruit women, minorities, and older populations into clinical trials. This includes efforts at NIH and NCI.
Regulatory agencies like the FDA are also coming on board. For example, researchers and companies seeking approval for late-stage clinical trials are now required to submit a plan demonstrating diversity among their trial participants. This effort is one way to make data better for everyone.
As researchers, we need to be thoughtful and intentional about how we recruit and include diversity. There are solutions, such as retooling informed consent, engaging the community in outreach, and attending to algorithm and data collection biases. Improving diversity will take time, but it’s time we don’t have, given the urgency of the research and lives at stake.
Are there other ways to mitigate bias in data?
To identify personal or societal biases (even if unintended), we need to understand how researchers collect and generate data. If we can de-bias data, it will improve data fairness all the way down the line, including how we train and interpret algorithms.
Likewise, as described in this study, a thorough evaluation of the research design and data-level methods can help offset bias. In data collection, this might mean oversampling and/or designing algorithms that account for missing information.
Researchers also can aggregate data at a population level. We learned from COVID-19 that tracking wastewater could offer clues about the prevalence of virus variants. Though purely epidemiological, these data do help form a more accurate picture of how COVID-19 spreads within a specific population, without relying solely on self-reports or interviews. In another study, Brewer and her colleagues showed that real-time data could be used to identify women at-risk for ovarian cancer, based on their purchases of over-the-counter medications.
Diversity needs to factor prominently in all our work and in our workforce. It simply makes cancer research and all our biomedical research better, giving us broader perspectives, more life experiences, and data that more closely matches our populations and their lived experiences.
Brewer HR, Hirst Y, Chadeau-Hyam M, Johnson E, Sundar S, Flanagan JM. Association Between Purchase of Over-the-Counter Medications and Ovarian Cancer Diagnosis in the Cancer Loyalty Card Study (CLOCS): Observational Case-Control Study. JMIR Public Health Surveill 9:e41762; 2023.
Buolamwini J, Gebru T. Gender shades: Intersectional accuracy disparities in commercial gender classification. Proc. Machine Learning Res 81;77–91; 2018.
Giovanola B, Tiribelli S. Beyond bias and discrimination: redefining the AI ethics principle of fairness in healthcare machine-learning algorithms. AI Soc 38(2):549–563; 2023.
Henry KE, Hager DN, Pronovost PJ, Saria S. A targeted real-time early warning score (TREWScore) for septic shock. Sci Transl Med 7(299):299ra122; 2015.
Hsu W, Hippe DS, Nakhaei N, et al. External Validation of an Ensemble Model for Automated Mammography Interpretation by Artificial Intelligence. JAMA Netw Open 5(11): e2242343; 2022.
Kozlov M. FDA to require diversity plan for clinical trials [published online ahead of print, 2023 Feb 16]. Nature 10.1038/d41586-023-00469-4;2023.
Larsen DA, Wigginton KR. Tracking COVID-19 with wastewater. Nat Biotechnol 38(10):1151–1153; 2020.
Nassar AH, Adib E, Abou Alaiwi S, et al. Ancestry-driven recalibration of tumor mutational burden and disparate clinical outcomes in response to immune checkpoint inhibitors. Cancer Cell 40(10):1161–1172; 2022.
Ott T, Dabrock P. Transparent human - (non-) transparent technology? The Janus-faced call for transparency in AI-based health care technologies. Front Genet 13:902–960; 2022.
Sabet C, Bajaj SS, Stanford FC. Recruitmentology and the politics of consent in clinical research. Lancet401(10373):262-263; 2023.
Tasci E, Zhuge Y, Camphausen K, Krauze AV. Bias and Class Imbalance in Oncologic Data-Towards Inclusive and Transferrable AI in Large Scale Oncology Data Sets. Cancers (Basel) 14(12):2897; 2022.
Leave a Reply
- Data Sharing (63)
- Informatics Tools (35)
- Training (34)
- Genomics (33)
- Data Standards (32)
- Data Commons (32)
- Precision Medicine (27)
- Seminar Series (22)
- Data Sets (21)
- Machine Learning (20)
- Artificial Intelligence (16)
- Leadership Updates (12)
- High-Performance Computing (HPC) (9)
- Imaging (9)
- Policy (8)
- Funding (6)
- Jobs & Fellowships (6)
- Semantics (4)
- Proteomics (4)
- Information Technology (2)
- Publications (2)
- Awards & Recognition (1)
- Childhood Cancer Data Initiative (1)
- Request for Information (1)