Cancer Data Science Pulse

Synthetic Data Helps Counter Lack of Diversity in Data

Over the last decade, NCI’s made significant strides in bringing more diverse populations into clinical trials. In time, these and other efforts will help generate research data that’s more representative of today’s society. But this work takes time, and technology won’t wait. Researchers need representative data now if they are to continue to make significant advances in cancer research.

One computational tool—synthetic data—could help. When faced with very imbalanced data sets, you can use computer-generated data that closely matches understudied members of society.

In this blog, Laritza Rodriguez, M.D., Ph.D., program director at NCI’s Center for Cancer Health Equity, describes Synthetic Minority Oversampling Techniques (SMOTE) and how it can help better address imbalanced biomedical data sets.

What is SMOTE?

SMOTE allows you to replicate features of K-nearest neighbors[1] to create synthetic data samples of the minority class, letting you bring more balance to an imbalanced data set. SMOTE is available in the Python library and in other data processing libraries.

Unlike other oversampling methods, which rely on copying existing variables, SMOTE does not replicate the minority class samples that already exist in the data set. Instead, it creates new samples by incorporating features from the existing class samples. You create synthetic cases based solely on existing samples.

SMOTE does not replicate the minority class samples that already exist in the data set. Instead, it creates new samples by incorporating features from the existing class samples. You create synthetic cases based solely on existing samples.

Using SMOTE, you have less risk of overfitting—that is, creating data that nearly matches the original data set—which is a known drawback of other oversampling techniques that duplicate existing samples.

You also can incrementally adjust parameters, such as the number of nearest neighbors to use and the percent increase in the total number of synthetic cases. This gives you flexibility as you intermittently test and validate the new data sets, helping detect potential overfitting of the model.

How is the synthetic data created?

You begin by selecting a minority class sample that you want to boost in the existing data set. Once you launch the SMOTE algorithm, it finds the nearest neighbor, randomly selects another K-nearest neighbor, computes the feature value of both, multiplies the difference by a random number (0-1), and generates a new feature value that falls in between the samples selected.

Why use this technique?

If you’re creating predictive models, you need to ensure that the approach applies to everyone, including typically underrepresented populations. Unfortunately, data are not all created equal, and it’s a common challenge to find sufficient representation of the target group.

For example, a study of an AI model trained on data from predominantly White populations, wasn’t nearly as robust in women with a prior history of breast cancer or Hispanic women. Oversampling techniques such as SMOTE can help avoid a lack of diversity in the data so you can gain better insight into important features in these groups.

Oversampling techniques such as SMOTE can help avoid a lack of diversity in the data so you can gain better insight into important features in these groups.

When you increase the number of minority class samples and balance the class distribution, you’re able to improve the performance of the predictive model. Increasing the number of minority class samples also lets you detect invisible signals in extremely sparse data sets where prediction without oversampling is not possible (e.g., detection of gene markers only present in a small number of members of a population). 

And finally, as mentioned earlier, because you’re generating synthetic samples, you’re able to avoid overfitting because you’re creating new, but similar, samples of the original data. 

What are some of the drawbacks of using SMOTE?

SMOTE has potential limitations. There is no existing reliable way to evaluate the synthetic samples. Although the new derived samples come from the original data, those samples may not be an accurate representation of the original data, which could have a significant impact on your model’s performance.

You also need to take care when applying the technique to closely overlapping classes, where there’s very little distinction between the minority and the majority class. In this case, you run the risk of creating synthetic samples that result in noise, which may negatively impact the machine learning model’s performance.

You need to take care when applying the [SMOTE] technique to closely overlapping classes, where there’s very little distinction between the minority and the majority class.

When working with large data sets, the computational cost of the technique can be high, especially if you’re making a lot of adjustments to the parameters. An incremental approach is most advisable, as it can save time and resources over the long run.

Other alternatives to help you improve the model’s performance include combining techniques. For example, you can oversample the minority class and under sample the majority class. Techniques, such as stacking and Adaboost ensembles, also can help improve the accuracy of the predictive models.

How can I trust the results?

Similar to any other approach, it’s imperative that you validate the model at different stages of the experiments.

Keeping the human in the loop is the best way to ensure you’ve created a true gold standard.
Keeping the human in the loop is the best way to ensure you’ve created a true gold standard, although this often is the most expensive of the validation options. The result, however, is well worth the effort, as you’ll be confident your new data set aligns more closely with the general population.

[1] K-nearest neighbor (KNN) is an algorithm that lets you define an unknown data point by comparing it to other data points in a given set. It lets you predict an unknown data point by comparing it to the data that are known.

Laritza Rodriguez, M.D., Ph.D.
Program Director, NCI Center for Cancer Health Equity (formerly known as the “Center to Reduce Cancer Health Disparities”)
Older Post
A Quick Start Guide to Cancer Data Science for Clinical Oncology

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.