Cancer Data Science Pulse

Striking a Balance Between Open Data and Individual Privacy

Today, data define our lives. The devices we use to connect to the internet leave traces of us behind every time we use an online service, such as making a doctor’s appointment, renewing a driver’s license, or ordering a pizza. Research shows that 2.5 quintillion bytes of data are generated each day.1

Our digital footprint can be “active” or “passive” depending on how we interact with our online world. When we create an account with a username, email address, and password, information about us is actively captured. When browsing a website we leave passive “cookies,” as the site records our browsing interests to tailor future content. These interactions generate personal data that are sometimes shared in ways we don’t intend. 

While these are examples of commercial data collection, data pertaining to medical records and care have similar privacy issues. 

On the one hand, these data have the potential to propel science forward at an unprecedented pace. Using data-driven research to inform disease treatment is not a new concept. Scientists have been mining data for this purpose since the human genome was first mapped nearly 2 decades ago. Those studies include all of the ‘omics investigations—genomics, proteomics, transcriptomics, metabolomics, and microbiomics. In fact, multi-omics are at the heart of precision medicine—enabling us to better tailor a medication or therapy to an individual based on their molecular makeup. 

On the other hand, sharing patient data, including information about diagnosis, treatments, and outcomes, is turning more attention than ever before to the topic of privacy. 

Safer Than You May Think

The federal government and National Institutes of Health (NIH) have been at the forefront of adopting rules to ensure that everyone’s right to privacy is protected, beginning with the Privacy Act of 1974 (amended in January 2019).2 Four key types of sensitive information are recognized: 

  1. Personally Identifiable Information (PII) (e.g., name, address, email address, etc.). 
  2. Sensitive PII (e.g., social security number, driver’s license, etc.). 
  3. Protected Health Information (PHI) (e.g., patient’s medical records, insurance claims, payment history, etc.).
  4. Other information, such as financial records, grant applications, proprietary data, etc.

The Privacy Act governs how these categories of sensitive information are collected, used, and maintained.

Additional protections were introduced in the Health Insurance Portability and Accountability Act of 1996 (HIPAA).3 HIPAA requires healthcare providers and organizations to develop procedures to ensure the security of PHI, no matter how it is collected or used.  

Informed Consent and Data Access

In 2008, Homer and his colleagues4 demonstrated how individuals might be identified using previously generated genotype data. That discovery began to put new pressure on old laws to better regulate today’s data privacy. 

One step toward safeguarding data privacy is the inclusion of “Informed Consent,” which helps study subjects understand the benefits and risks of participating in a clinical trial and sharing data. To fully consent to data use, the participant must approve every potential use of data—both current and future. 

Yet, with an evolving data ecosystem, it simply may not be possible to consent for every potential situation. This is an area that continues to receive attention and requires further scrutiny to ensure the benefits of open science are balanced against the privacy needs of the individual. 

NCI shares data generated through its programs with the research community in accordance with NIH policies. Data are available in open- and controlled-access tiers as required by informed consent to protect patient privacy and confidentiality.

In addition to clearly defining how data will be used and obtaining patient consent, the use of data is reviewed by members of Institutional Review Boards (IRB). IRB committees are formed at each data-collecting organization and are comprised of physicians, researchers, and community members who review and approve (or deny) study proposals. 

NCI shares data generated through its programs with the research community in accordance with NIH policies. Data are available in open- and controlled-access tiers as required by informed consent to protect patient privacy and confidentiality.

Access to the controlled data and metadata files requires authorization through the National Center for Biotechnology Information’s (NCBI’s) database for Genotypes and Phenotypes (dbGaP). Data stored in dbGaP have varying degrees of confidentiality, and, therefore, require strict access management. 

Following Homer and his colleague’s findings, and per the Genetic Information Non-Discrimination Act (GINA),5 NCI moved genotype data to dbGaP. This centralized access gives NCI greater control over the secondary use of data6 as users must apply for Data Use Certification through dbGaP and must agree to data use limitations. NCI’s Data Access Committee members review and approve (or disapprove) the Data Access Requests to the controlled-access data sets for which it is responsible. Such controlled data access enables NIH to share restricted-use data files under highly controlled conditions. 

Additionally, NIH has other ways of allowing data to be shared with researchers around the world without needing to restrict access. This involves data that have been “scrubbed” of personal information so they can be shared on a much broader scale. 

Defining De-Identification and Privacy Protection 

Current laws view patient data in two ways: identifiable data (those that would allow for identification of a patient, which retain all legal protections) and non-identifiable data (those that are made anonymous at the time of collection and do not retain legal protections).

This ambiguity around legal protections led many researchers to opt out of data sharing, noting they didn’t want to risk divulging PHI. To alleviate these concerns, a process called de-identification is employed. Through this process, data are de-identified and therefore can’t be tracked to a specific individual. 

De-identification also gives researchers a way to assign surrogates to the data (that are not linked to personal identifiers) in place of PII. These surrogates, or global unique identifiers (GUID), allow researchers to use an automated system to assign unique identities, or codes, to an individual’s data without resorting to personal information. The identifiers can help connect different data types (e.g., multi-omics, clinical/phenotypic data) across the repositories and registries without ever divulging personal information. 

GUIDs facilitate secondary data use, allowing deeper analyses into research questions not related to the initial research project. Using GUIDS also aids in future discoveries, such as more effective and personalized therapies for cancer.

At NIH, the issue of data privacy has been formally addressed in the NIH Data Sharing Policy.7 A two-step de-identification process is used for all data submitted to dbGaP. The data must be stripped of PII to meet the HIPAA de-identification safe harbor standards, commonly known as the “18 identifiers.”8 HIPAA allows the disclosure of PHI for research as long as the data are de-identified.3

Big Data Balancing Act 

Advances in machine learning and artificial intelligence have greatly enhanced the complexity of how data are shared and combined. These new technologies have the potential to complicate privacy issues and may increase the possibility of reidentifying a study participant. 

Education will be key. Efforts to educate researchers and physicians on issues related to data privacy should begin early, even at the undergraduate and graduate levels. This will help cultivate good habits around data usage and aid in protecting individual privacy.

What works to maintain the balance between privacy and data usability likely will change over time. This calls for a greater emphasis on data privacy education and training at all levels. For example:

  • Study participants who consented to share their data broadly must understand the data may extend well beyond the immediate study and could be used by other researchers to study new facets of the disease or its treatment. 
  • Computational and non-computational scientists should be careful when handling participants’ data as some algorithms could pose a risk of identification.

Education will be key. Efforts to educate researchers and physicians on issues related to data privacy should begin early, even at the undergraduate and graduate levels. This will help cultivate good habits around data usage and aid in protecting individual privacy.

Educating the public is equally important. Information on how data are stored, accessed, and used should be made widely available. Steps are taken now to educate study participants at the time of data collection, through the informed consent process. However, more steps need to be taken to ensure the community, as a whole, understands the complete data life cycle. Efforts to broadly inform the public also will help build trust that researchers are taking the proper measures to safeguard everyone’s information. 

As we move into the future, new privacy policies and processes need to be developed, and existing policies updated, to adapt to the ever-changing “big data” landscape. Ensuring data can be safely and securely shared will maximize taxpayers’ investment in research. Broad and open data sharing undoubtedly will lead to better treatments and cures, not just for cancer but for all diseases and disorders that threaten public health. 



Data Never Sleeps 5.0.

The Privacy Act of 1974. Pub. L. 93-579, 88 Stat. 1986, enacted December 31, 1974, ‎5 U.S.C. ch. 5 § 552a

Health Insurance Portability and Accountability Act of 1996, Pub. L. No. 104-191, 110 Stat. 1936 (Aug. 21, 1996). 45 C. F. R. § 164.02(b)(1).

4 Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, et al. (2008) Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays. PLoS Genet 4(8): e1000167. 

5 H.R 493: The Genetic Information Nondiscrimination Act of 2008 (accessed April 29, 2021). Public Law No: 110-223.

6 P3G Consortium, Church G, Heeney C, Hawkins N, de Vries J, Boddington P, et al. (2009) Public Access to Genome-Wide Data: Five Views on Balancing Research with Privacy and Protection. PLoS Genet 5(10): e1000665.

7Final NIH Statement on Sharing Research Data (accessed November 23, 2019). NOT-OD-03–032 (Feb. 26, 2003) (accessed November 23, 2019).

Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.

AAAS Science and Technology Policy Fellow, NCI Office of Data Sharing
Health Science Administrator, NCI Office of Data Sharing
Vivian Ota Wang, Ph.D.
Policy and COVID Activities LEAD Division of Program Coordination, Planning, and Strategic Initiatives Office of the Director, NIH
Older Post
For the Love of . . . Data! CBIIT Director Tony Kerlavage Looks at Advances in Data and Technology
Newer Post
Computer Savvy Scientist Blends Technology with Biology to Create Attention-Based Deep Learning Methods for Genomics Data

Leave a Reply

Vote below about this page’s helpfulness.

Your email address will not be published.


Enter the characters shown in the image.