New Approach Helps Address Lack of Diversity in GWAS Cancer Data
In a new article published in Nature Genetics, Dr. Haoyu Zhang, from NCI’s Division of Cancer Epidemiology and Genetics, seeks to solve a key issue related to research using genome-wide association studies (GWAS).
Researchers have been using GWAS data for decades as a way of teasing out genetic influences on health. But up until now, most of the GWAS data have been very Euro-centric (i.e., the data are primarily from people with a European ancestry). This disparity in the data is akin to the “algorithmic bias” seen in some machine learning applications (e.g., in facial recognition software).
To help address GWAS data disparities, Dr. Zhang and his colleagues developed CT-SLEB—a powerful and computationally scalable method for generating more precise polygenic risk scores across a range of ancestral groups, including Latino, African American, East Asian, and South Asian.
CT-SLEB harnesses the power of multiple statistical techniques, letting researchers, like you, make better predictions using genetic data from diverse populations. With CT-SLEB, you can generate GWAS summary statistics through the following three-step process.
- Step 1 applies a two-dimensional clumping-and-thresholding (CT) method to specific ancestral groups.
- Step 2 assigns a quantitative assessment to those risks using an empirical Bayes (EB) approach.
- Step 3 uses a super learning (SL) model (i.e., a type of machine learning model) to sort those groups and to combine the results based on multiple genetic risk scores.
To test CT-SLEB, Dr. Zhang and his team examined adult height (a model trait because it’s a known risk factor in Euro-centric studies) for a range of complex diseases, including heart disease and cancer. The researchers used GWAS data from 23andMe, Inc., the Global Lipids Genetics Consortium, AllofUs, and the UK Biobank. In all, these data included more than 5 million people of diverse ancestry.
The researchers found that although the link between height and disease was significant, this factor varied in prominence depending on ancestry and wasn’t nearly as accurate in non-European populations. The pattern found with height proved similar with other traits, including those used to predict disease outcomes. As noted by Dr. Zhang, “Predicting outcomes of complex diseases, such as cancer, is an area of keen research interest. Our approach was markedly better at predicating disease in people of African origin, where polygenetic prediction has been most challenging.”
Dr. Zhang credits their SL model with helping improve CT-SLEB’s prediction performance. Using this model, the team was able to combine multiple genetic risk scores from numerous sources but not without challenges. The researchers needed to obtain joint data agreements for accessing the data. It also took time to pre-process the data and perform quality control.
In all, the researchers spent more than 4 years and logged over 1,000 “commits” (or records of changes) across two GitHub repositories in developing CT-SLEB. Said Dr. Zhang, “Clearly, a lot of hard work, time, and dedication went into this research, but we’re very pleased with the result.”