Cancer Data Science Pulse

Data Set 411: The National Lung Screening Trial

In honor of National Lung Cancer Awareness month, we’re highlighting the “data deets” (or details) for the National Lung Screening Trial (NLST). This large-scale effort collected imaging data for more than 53,000 heavy smokers. With much of the data now publicly available through NCI resources, we’ll cover the research that generated this data, its metrics, how to access it, and some of the exciting data science activities using it to improve lung cancer diagnosis and screening.

The Study—How Was Data Collected?

With recommendations from NCI’s Lung Screening Study group, the American College of Radiology Imaging Network conducted the trial to compare how well spiral computed tomography (CT) and standard chest X-rays detected lung cancer. Starting in 2002, NLST enrolled 53,454 heavy smokers between the ages of 55 and 74. The participants had a smoking history of at least 30 pack-years (i.e., a person smoked at least one package of cigarettes a day for 30 years) and no symptoms, signs, or history of lung cancer. Over the next seven years, participants received three annual screenings with either the spiral CT or X-ray. Researchers reported new cases and deaths from lung cancer to trial coordinators. A publication in 2011 showed  that participants who received the spiral CTs had a 15-20% lower risk of dying from lung cancer than those who had the chest X-rays. 

The Stats—What Should You Know and How Do You Get the Data

The seven-year NLST trial generated 200 million files (10 terabytes) of medical imaging, pathology, clinical, and protocol data. The X-ray images from this study are currently not available. However, the following table shows the breadth of available data through NCI resources like NCI’s Imaging Data Commons (IDC), The Cancer Imaging Archive (TCIA), and NCI’s Cancer Data Access System (CDAS). Much of the NLST data set is open access and can be analyzed via the browser, downloaded, or imported through IDC or TCIA APIs.

Available Data Description Data Set Size
File Types Where to Access
CT Scan Images Only images from CT screening exams are available from the study. The interface separates the files into: 
  • screen-detected lung cancer with image in same study year.
  • lung cancer with image.
  • nodule positive or screen positive with image in same study year.
  • three negative screens with three images.
  • other participants.
The images are DICOM format. Explore the objects and relationships in the DICOM data model.
25,000 DICOM
Pathology Images Hematoxylin and eosin (H&E)-stained images from the smaller Lung Screening Study of the NLST are provided. Though the images aren’t annotated, researchers can request to review the case report forms where region of interests were recorded. 450 SVS
Tissue Samples Twice a year, you can request tissue samples for a subset of the NLST participants who developed lung cancer during the trial through NCI’S Prostate, Lung, Colorectal, Ovarian Etiologic and Early Marker Studies Program. 438 N/A CDAS
Clinical Data NLST provides 15 clinical data sets including demographic, mortality, abnormality, and other clinical information upon request. Data dictionaries for these data sets are listed alongside the data. 54,454 SAS or CSV

Full data:

Some data:

The Science—How Are Others Using the Data?

So, what has us most excited by this data? It’s the number of data science projects using this data to better detect and screen for lung cancer! 

Machine Learning Project

With its plethora of annotated imaging and clinical data, large data sets like NLST are ideal candidates for training machine learning models. Dr. Ronald Summers, the principal investigator within the NIH Clinical Center’s Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, is using NLST data to develop machine learning algorithms to localize nodules, masses, and other abnormalities in chest X-rays. In the future, these algorithms could become the cornerstone for automated tools that help clinicians detect lung cancer.

Statistical Model Project

The NLST data can also validate the statistical models for risk prediction. Dr. Stuart Baker, a mathematical statistician in the NCI Division of Cancer Prevention’s Biometry Research Group, is evaluating how well a six-step parsimonious quadratic logistic regression model can predict the risk of lung cancer in the X-rays of NLST participants. Better risk prediction models can open the door to improved tools for cancer screening and prevention.

Computational Models to Reduce Health Disparities

Additionally, investigators from NCI’s Division of Cancer Epidemiology and Genetics (DCEG)—Drs. Hormuzd Katki, Li Cheung, Anil Chaturvedi, and Rebecca Landy—use NLST data to develop statistical prediction models to identify who is at the highest risk of lung cancer, and who has the highest benefit from lung cancer screening (as measured by life-years gainable), with the goal of reducing health disparities of lung cancer incidence and mortality. These models are for pre-screening lung cancer risk (via the Lung Cancer Risk Assessment Tool, or LCRAT), then updating pre-screening risk with features from the CT image, including AI algorithm scores (LCRAT+CT), and the Life-years From Screening-CT (LYFS-CT) model, which estimates the increase in life-expectancy a person could have by being in screening. Using these models clinically could help clinicians balance the benefits of screening with the potential harm, improve the effectiveness and efficiency of screening, and may also help reduce the disparity in screening eligibility between White and African American/Black individuals. The models can help by identifying individuals who had the greatest likelihood to benefit from screening, but were missed by national preventive screening criteria. The models are publicly available as part of a clinical decision support tool for the Epic EHR called Decision Precision+.

Software Project

Finally, NLST data have contributed to a tool that helps minimize a patient’s exposure to radiation. Dr. Choonsik Lee, an investigator in NCI DCEG’s Radiation Epidemiology Branch, integrated NLST data to further develop NCI’s dosimetry system for Computed Tomography. Clinicians and radiation technologists can use this free, downloadable tool to monitor a patient’s exposure to radiation from a CT scan, thereby reducing potentially harmful side effects from regular screenings.

These are just a few projects leveraging NLST data. We hope these examples can inspire your own project to help the community better learn from, detect, diagnose, and treat cancer.

Want to see more data “deets”?

Select the thumbs up to let us know if we should do more of these blogs. Please leave a comment if there’s a data collection we should spotlight or additional information you would want to see.

Older Post
A Tail-Wagging Good Time—Working on the Integrated Canine Data Commons
Newer Post
Improving Access to NCI’s Individual-Level Genomics and Other Omics Data

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.


Enter the characters shown in the image.

This information helped me to know more about lung issues. Thank you
<a href="">tancap4d</a>
We’re glad this blog could provide you with helpful additional information. If you’d like, you can explore more of our lung cancer related articles and resources on the website
An informative blog by CBIIT, shedding light on the significant National Lung Screening Trial (NLST). With data from over 53,000 heavy smokers, this large-scale effort provides a wealth of insights into lung cancer diagnosis and screening. The transparency in sharing details about the research, metrics, and accessibility of the data through NCI resources underscores a commitment to advancing data science activities. Kudos to CBIIT for facilitating the dissemination of valuable information that contributes to the ongoing efforts to improve lung cancer outcomes.
Thank you! We’re glad you found this blog informative. Make sure you’ve subscribed to our weekly email to receive updates on the latest CBIIT blogs, news, and more.