Keeping Big Data Fraud at Bay With Machine Learning
Exposing fraud in large biological data sets is the topic of NCI’s Office of Cancer Clinical Proteomics Research’s new blog, which highlights recent findings from scientists in the Clinical Proteomic Tumor Analysis Consortium (CPTAC).
The blog is based on a study by Michael S. Bradshaw and Samuel H. Payne, “Detecting fabrication in large-scale molecular omics data,” published in PLOS ONE. The study offers a proof-of-concept approach to identifying fraudulent data.
The investigators acknowledged that data errors and inaccuracies can be introduced in a number of ways, but noted that, “Data in biological sciences is particularly vulnerable to fraud given its size.” It’s simply much easier to hide manipulated data. Moreover, normal channels for identifying problems with data haven’t kept pace with the potential for fraud.
The authors believe the solution rests in technology. Their approach was grounded in Benford’s first digit law, which looks at the distribution of digit frequencies in a set of numbers, and flags those that don’t fit the pattern. That method is frequently used to identify fraud in the financial field but hasn’t yet been applied to biological data.
The investigators expanded the Benford technique to better match values found in a biological data set. They developed a machine learning model to look for specific digit-frequency patterns, training it on real data from CPTAC. The model then was tested on a data set that included fake data (e.g., random number generation, resampling replacement, and imputation errors). Their model proved to be highly successful, achieving, on average, nearly 100% accuracy in detecting fabricated data.
Further research will be needed to see if this digit-frequency-pattern approach has real-world applications, and whether it can be used to uncover data fraud in other types of biological data.