Cancer Data Science Pulse
In The Year 2030—Looking at How Genomic Data Might Evolve
There’s a seemingly insatiable desire for data that’s shared across the scientific spectrum. And rightfully so, as more and better data are essential for making informed, evidence-based decisions to advance research and improve public health.
Data are the driving force behind the march toward precision medicine. Imagine a day when your healthcare is so personalized that there’s no guessing as to what medication will work best for you or whether you are at risk for a particular disease. The answers will be right at your and your physician’s fingertips.
This is a bold prediction. A lot has to happen to move research forward and to make this a common occurrence. This is the premise behind a new seminar series being hosted by the National Human Genome Research Institute (NHGRI), called “Bold Predictions for Human Genomics by 2030.” This unique series sets the bar for what might come. It includes 10 lofty but, for the most part, attainable goals.
NHGRI’s first bold prediction is one that dovetails particularly well with data science: “Bold Prediction #1: Generating and analyzing a complete human genome sequence will be routine for any research laboratory, becoming as straightforward as carrying out a DNA purification.”
If it comes to fruition, this bold prediction represents a quantum leap for data scientists. Much of today’s genomic and other data relies on filling in missing information using predictive modeling. That is, we develop models to infer what’s missing. Such models can come very close to real-life scenarios. But imagine if scientists didn’t have to infer to find solutions? This could lead to an abundance of highly accurate data.
If we could make generating and analyzing the complete human genome sequence routine, we would have a way to directly observe every form of variation. This would shift the research emphasis from primarily developing methods to help fill in data, to actually answering important biological and biomedical questions.
The Current State of the Field
Scientists first unraveled large portions of the human genome in 2001, setting in motion a new way of thinking about disease risk and potential treatments. It’s been a game changer, especially in the cancer field, helping to identify and treat people with breast cancer, prostate cancer, and more.
Unfortunately, scientists have been stymied in their efforts to sequence a person’s “complete” genome. One issue has been that we’ve had to rely on relatively short “reads” of the sequences that make up chromosomes. Scientists capture these short reads and align them to a reference. But the reference remains incomplete. For example, it’s been difficult to distinguish highly repetitive sections (such as heterochromatin, which are tightly packed regions of DNA) and gene arrays (where genes are repeated multiple times). The content of centromeres (chromosome constrictions that are important for separating the strands during cell division) and telomeres (the “caps” of non-coding material at the ends of each chromosome) also make deciphering the genome difficult.
Although allusive, these areas can make up as much as 11 percent of a chromosome and are vital for cell reproduction and survival. Identifying and understanding these missing pieces will give us even greater genomic power.
Fortunately, this short-read liability is changing. Scientists now are able to use existing technologies such as nanopore and single-molecule real-time (SMRT) sequencing to generate “ultra-long” reads that can routinely span more than 100,000 base pairs. Read lengths of greater than 1 million base pairs are even possible, as is automated assembly. In addition to very long reads, we’re making rapid progress in charting complete telomere-to-telomere chromosome assemblies using extremely accurate high-fidelity (HiFi) reads and combining them with the ultra-long reads. These two technologies allow us to map sequences to specific locations and soon may make it possible to complete the entire human genome using currently available technologies.
What's to Come?
The significant advances in understanding the genome set up NHGRI’s first bold prediction. Next, we’ll be able to track how genes are passed from one generation to the next and distinguish exactly what is inherited from each parent. We’ll see how individuals differ from one another and what they have in common. In short, we’ll be able to complete the blueprint that makes us uniquely human.
With a fully deciphered genome, coupled with a lowered cost for conducting genomic research and the advent of CRISPR, scientists will soon approach a point where routine genomic testing will be possible in a variety of settings, and all currently unsolved Mendelian disease questions will be answered.
It’s fast-moving and exciting, as it opens up new, even bolder, predictions. We’re not one genome. Our genomes change as we age and as we’re exposed to different environmental influences. Future research will be able to shift from looking at a full genome to looking at data from a single cell and how it changes over time.
Perhaps someday, a clinician will be able to sequence a person’s complete genome at birth as part of a standard medical record. That data then could be used over the person’s lifetime to prevent diseases, predict outcomes, and target treatment as problems arise.
We could practice preventative medicine instead of reactionary medicine. It will change everything. It’s an ambitious prediction, but with the advances taking place in genomics data, the future may be closer than we think.
Learn more about the Bold Prediction seminar series. The next presentation will take place on March 8, 2021, 3:00 pm–4:30 pm ET.
Resources
Porubsky, D., Ebert, P., Audano, P.A. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0719-5
Miga, K.H.; Sergey, K.; Phillippy, A.M, et al. Telomere-telomere assembly of a complete human X chromosome. Nature 585; 79–84, 2020.
Categories
- Data Sharing (65)
- Informatics Tools (41)
- Training (39)
- Genomics (36)
- Data Standards (35)
- Precision Medicine (34)
- Data Commons (33)
- Data Sets (26)
- Machine Learning (24)
- Artificial Intelligence (23)
- Seminar Series (22)
- Leadership Updates (14)
- Imaging (12)
- Policy (9)
- High-Performance Computing (HPC) (9)
- Jobs & Fellowships (7)
- Semantics (6)
- Funding (6)
- Proteomics (5)
- Awards & Recognition (3)
- Publications (2)
- Request for Information (2)
- Information Technology (2)
- Childhood Cancer Data Initiative (1)
Leave a Reply