Cancer Data Science Pulse

Why We Love Diverse Data

February 12–16, 2024 is “Love Data Week!” At CBIIT, we LOVE data. We particularly love diverse data.

We’re not alone in our love of diverse data. Each year, more and more cancer researchers are recognizing the value of generating, collecting, sharing, and using data that truly reflect members of today’s diverse population. We asked a few of those data-loving scientists to tell us why they value diverse data and what tips they have for helping others (like you) increase diversity in research data. Check out their responses below.

We’d love to hear from you too! Use the comment box at the end of this blog to tell us why you love diverse data!

First, let’s meet the scientists:

AjayAggarwal-circle head shot


 Ajay Aggarwal, Ph.D., professor and clinical oncologist at the London School of Hygiene & Tropical Medicine, United Kingdom

LeeCooper-circle headshot


 Lee A.D. Cooper, Ph.D., associate professor of pathology; director of computational pathology; director of the Center for Computational Imaging and Signal Analytics at Northwestern University Feinberg School of Medicine, Chicago, IL

MiaGaudet-circle headshot


 Mia M. Gaudet, Ph.D., senior scientist at NCI’s Division of Cancer Epidemiology and Genetics



 Ophira Ginsburg, M.D., senior advisor for clinical research at NCI’s Center for Global Health



 Laritza Rodriguez, M.D., Ph.D., program officer at NCI’s Center to Reduce Cancer Health Disparities



 Haoyu Zhang, Ph.D., Earl Stadtman tenure-track investigator at NCI’s Division of Cancer Epidemiology and Genetics


Now, let’s see what they say about using data with today’s high-tech tools, such as machine learning and artificial intelligence (AI).

Why is it important to use diverse data in today’s models?

Dr. Laritza Rodriguez: Our data models and applications are only as good as the data itself. The old adage, “garbage in, garbage out,” first used in 1957, remains just as true today. It’s completely irrelevant how fast or powerful a computer is. If we want robust and consistent results from our applications, we need good data. The only possible way we can understand today’s healthcare problems and solutions is to study the data from the whole population.

Where can we find diverse data?

Dr. Lee A.D. Cooper and Dr. Mia Gaudet: We recently published a study about a machine learning model (called Histomic Prognostic Signature, or “HiPS”) for predicting breast cancer prognosis. We knew it was important to use diverse data sets to increase the model’s prognostic power.

We collaborated with the American Cancer Society to use data from their Cancer Prevention Study II Nutrition Survey. This is a long-term prospective study of cancer incidence in the United States and Puerto Rico and includes approximately 184,000 men and women. By necessity, in our study, we needed comprehensive data from a large number of breast cancer cases.

Our conclusions benefitted from diverse data that included both large and small community hospitals. The publicly available data sets today are typically from patients diagnosed and treated at large academic medical centers. Patients at these larger centers may have different demographics or circumstances (i.e., social determinants) than those treated at small, community hospitals. Furthermore, some diagnosing pathologists at community hospitals may lack subspecialty training in breast pathology, which we know improves the quality of diagnoses. Using these diverse data gave us confidence that our model would be beneficial for all patients, not just those treated at one particular place.

What about diversity in genetic data?

Dr. Haoyu Zhang: Data from genome wide association studies (GWAS) are mostly Euro-centric (i.e., primarily from people with a European ancestry). We developed a tool to help increase diversity in GWAS data, as noted in an October 2023 news article on this website.

We used GWAS data from 23andMe, Inc., the Global Lipids Genetics Consortium, All of Us, and the UK Biobank. These data included more than 5 million people of diverse ancestry, including Latino, African American, East Asian, and South Asian.

When we include genetic information from a wide array of ethnic backgrounds, we can uncover associations between genetic markers and diseases that might be specific to non-European populations or that manifest differently across ethnic groups. This gives us a more complete understanding of genetic influences on health and can drive personalized treatment plans that are effective across the entire spectrum of human diversity.

How do we generate more diverse data?

Dr. Ophira Ginsburg: It’s not a simple solution. In my view, the key to more equitable inclusion is broad, comprehensive, and sustained engagement from all the relevant stakeholders—from researchers and clinicians to patients and their families. This is critical, especially for people who are living with or recovering from cancer, who should be engaged at all phases of the clinical research life cycle.

In working on the Lancet commission (“Women, Power, and Cancer”), we were all struck by how little we know about sex as a biological variable in clinical cancer research! We are, in many ways, just learning how sex (and related factors) influence the way a drug will work and its toxicity. I’d add too that few investigators have studied how gender, a social construct, with deeply embedded power dynamics, translates to a real-world setting (e.g., influencing how a woman accesses care or her decision to join a clinical trial).

Dr. Ajay Aggarwal: We’re focusing on improving treatment care and access. We’re linking treatment-access data with travel-time information so we can flag inequalities in how people find and access treatment. In addition, we’re working on real world data models to determine how best to deliver cancer services—in terms of how far the patient needs to travel, hospital capacity, and, most importantly, outcomes. We’re designing our model to work in different health systems across a range of cancer and modality types, with the goal of delivering better care to patients who are in greatest need.

Are we making progress?

Dr. Ophira Ginsburg: Yes! We’ve made considerable progress over the years. For example, with respect to gender parity, we now have more women enrolled in trials (as well as more female researchers leading those trials) than ever before. In the United States, we’re leading the way with significant policy changes. As recently as the 1970s, the U.S. Food and Drug Administration banned women of childbearing age from clinical trials. With the advent of NIH’s Revitalization Act of 1993, we started actively recruiting women of all ages and diversity to take part in research. By 2016, NIH implemented the policy, “Sex as a Biological Variable,” in research and, more recently, the policy for inclusion across the lifespan, which further enhances this mandate.

We still have a long way to go to realize the full range of benefits, in terms of the science, and its potential impact on population health. But, at NCI we’re committed to advancing the best science to reduce suffering and death from cancer for all people—around the globe.

Dr. Ajay Aggarwal: I’ve seen similar progress in the United Kingdom. I serve as director of 10 national cancer audits of cancer care in the UK. With these audits, which include cancers of the prostate, breast and bowel, we’re able to link data from diagnostic, surgical, radiotherapy and systemic treatment for all patients treated in any hospital in the National Health Service (which covers 95% of the population’s cancer care). We’re using these diverse data sets to develop the first public outcome reporting programs for radiotherapy and (soon) systemic therapy. Our aim is to drive improvements in cancer care and to give patients greater transparency for making decisions about their treatment.

Are there other ways researchers might address a lack of diversity in their data sets?

Dr. Lee A.D. Cooper and Dr. Mia Gaudet: We have a great opportunity to use AI to reduce health disparities, for example by using tools that offer specialty-level diagnoses in places where specialists aren’t available. However, we want to point out that there are risks if we don’t consider the differences in populations. If the data we use to develop and validate models doesn't match the population we’re applying that model to, then AI may actually increase disparities.

Most importantly, data diversity is something we need to consider right from the start, before generating data, and not after the fact.

Dr. Laritza Rodriguez: It’s ideal to start with good quality data right from the beginning. A number of proven methods can help us achieve statistical power. We can optimize machine learning models using algorithms to help balance data sets (e.g., undersampling the majority classes or oversampling techniques for minority classes).

Likewise, we can address bias through random oversampling of the minority class. We need to apply these methods carefully, because these techniques can also lead to overfitting, where the model is so specific to the data set that the training model no longer recognizes new data. To avoid overfitting, we can use “random oversampling with noise,” where we introduce noise/artifacts (i.e. irrelevant information) to the newly created sample, resulting in a data set that is not an exact copy of the existing data set.

With very imbalanced data sets, we can use Synthetic Minority Oversampling Techniques (SMOTE). Put simply, we create synthetic samples using features that closely mimic the original data (i.e., the k-nearest neighbors). This technique isn’t an exact stand-in for missing minority class data, but instead replicates data points that have features that are “close” to what we’d see in the actual minority class. We use these techniques to manage extremely imbalanced data sets, such as with oil spills, labor and delivery data, and diabetes, among others. SMOTE does have some drawbacks. We can inadvertently introduce noise, depending on how we select the k-nearest neighbors and how many we include in the model. We can overcome some of these issues by using other techniques, such as SVM-SMOTE, Borderline-SMOTE, and Adaptive Synthetic Sampling. Regardless of the technique we use, it’s always imperative to validate the model.

At NCI, we’re working hard to increase diversity across our scientific community, from the people who plan and conduct the research to encouraging patient engagement and participation. We know there are challenges in recruiting and retaining participants, especially in long-term studies. Alternatively, we have data management techniques (described above) that can increase signals, offering useful information from populations of interest and giving us better insight into cancer in all populations.

Ajay Aggarwal, Ph.D.
Professor and Clinical Oncologist, London School of Hygiene & Tropical Medicine, United Kingdom
Lee A.D. Cooper, Ph.D.
Associate Professor of Pathology; Director of Computational Pathology; Director of the Center for Computational Imaging and Signal Analytics at Northwestern University Feinberg School of Medicine, Chicago, IL
Mia M. Gaudet
Senior Scientist, Division of Cancer Epidemiology and Genetics, National Cancer Institute
Ophira Ginsburg, M.D.
Senior Advisor for Clinical Research, Center for Global Health, National Cancer Institute
Laritza Rodriguez, M.D., Ph.D.
Program Officer, Center to Reduce Cancer Health Disparities, National Cancer Institute
Haoyu Zhang, Ph.D.
Earl Stadtman Tenure-track Investigator, Division of Cancer Epidemiology and Genetics, National Cancer Institute
Older Post
Living on the Edge—How Video Games Helped Shape the Future of Cancer Care

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.


Enter the characters shown in the image.

Diverse data allows me to study health disparities in real time!
Thank you for sharing, Kimberly! There's great value in having access to data that reflects today's diverse population, and we're glad to hear how you're making use of it!