Sep 02 2022
Data Analytics

What Is the National COVID Cohort Collaborative (N3C) Data Enclave?

N3C gives researchers access to a large enclave of COVID-19 data related to the entire patient journey, including symptoms and outcomes.

In many ways, the pandemic has left scientists and the public with more questions than answers. However, a new data enclave is helping researchers learn more about the coronavirus, long COVID-19 and treatments. With 17.6 billion rows of data from 15 million patients at 75 institutions, researchers have access to a growing collection of COVID-19-related data.

The enclave, called the National COVID Cohort Collaborative (N3C), is lead by the National Center for Advancing Translational Sciences (NCATS) at the National Institutes of Health under the U.S. Department of Health & Human Services. NCATS worked with academic and community health centers as well as federally qualified health centers and Clinical and Translational Science Awards (CTSA) Program organizations to build the database, which is currently being used by at least 312 institutions for 375 research projects.

The creation of N3C began prior to the pandemic. However, the public health crisis became the catalyst for a large-scale healthcare data initiative.

Click the banner for access to exclusive HealthTech content and a customized experience.

What Is N3C, and What’s the Technology Behind It?

Dr. Kenneth Gersing, director of informatics at NCATS, says the organization gets groups to work together and turns data silos into a network through shared services. To make a program such as N3C possible, it’s important for NCATS to find economies of scale.

The organization also leads the Rare Diseases Clinical Research Network, a precursor of N3C, and it was able to use some of the same technology and processes around data cleanup, analysis and output to support the N3C initiative. In 2017, NCATS started piloting Palantir instances in the cloud using Amazon Web Services to create a secure analytic environment. The organization deployed Google Workspace in 2019 to allow the research community to easily share findings.

The organization also harmonizes all the COVID-19 data collected from healthcare institutions, which use a variety of common data models. NCATS provides all the tools researchers need to access and analyze the data and ensures the security of the enclave and protection of patient privacy.

RELATED: Find out how technology helps identify and track social determinants of health data.

Gersing explains that NCATS handles identity authentication, cloud deployment, security, Software as a Service support, single sign-on, ticketing and compliance concerns so that researchers can focus on science.

The goal of N3C is to share information with the community. If a researcher brings in an algorithm to run against the enclave, it becomes part of the assets available to the community. The algorithm would need to be evaluated for security purposes prior to approved use.

How Is N3C Data Accessed by Researchers?

N3C’s enclave includes data from at least 5.9 million COVID-19-positive patients, plus data from two controls for every positive patient. New data is collected, harmonized to the Observational Medical Outcomes Partnership common data model and released weekly.

“We have almost 4,000 volunteers. There is no way this would be possible without a community coming together and helping,” says Gersing.

Some of the patient data collected dates back to January 1, 2018, offering researchers a fuller picture of patient journeys.

N3C uses a centralized rather than federated model. In federated models, researchers can ask a question such as, “How many of the female patients over 60 have hypertension?” They would receive a number but wouldn’t have access to row-level data. Using a centralized model removes that limitation.

“We wanted researchers to be able to reach the data directly and iterate over it,” says Gersing. "Particularly, we wanted to be able to use technologies like machine learning, which is difficult to do in a federated model.”

N3C also ensures that definitions are consistent, which is important in cross-model data harmonization. Different organizations must agree on the definition of a visit, for instance, Gersing explains.

Dr. Kenneth Gersing
NCATS is committed to being the steward of data, not the owner of data, and to providing the resources people need to do the work. We take that on as our job so researchers can focus on science.”

Dr. Kenneth Gersing Director of Informatics, NCATS

A privacy-preserving record linkage (PPRL) keeps data secure. The enclave is HIPAA-compliant, and none of the data accessible through N3C contains identifying patient information. The N3C PPRL also enables deduplication, linkage of multiple data sets and cohort discovery. A PPRL linkage honest broker holds de-identified tokens and matches tokens generated across disparate data sets to create a singular match ID in specific use cases. Regenstrief Institute is the linkage honest broker for N3C.

N3C users must agree to the data user agreement, promising not to abuse, download or re-identify data. Results may be downloaded, but not the original data. Researchers who register for N3C and request data must take a code of conduct exam in addition to an IT security training class. The request goes to the data access committee, and it determines minimum necessary access.

The released data can be accessed at three levels: synthetic data, de-identified data and limited data set. Synthetic data is a probabilistic data set, but not real data. However, NCATS has reached the conclusion that the synthetic data is secure enough and high-quality enough to be opened to everyone. The organization is taking small steps toward that goal. Most citizen scientists will gain access at the synthetic data level, or level one. NCATS also gives synthetic data back to the 75 sites inputting data into the enclave so they can use it for their communities, according to Gersing.

“NCATS is committed to being the steward of data, not the owner of data, and to providing the resources people need to do the work,” he adds. “We take that on as our job so researchers can focus on science.”

DISCOVER: How to clinical decision support systems improve patient outcomes.

The Future of Healthcare Research Depends on Collaboration

Gersing says that funding is a major obstacle to the development of other data enclaves. The costs for central processing units and continued harmonization efforts can be high. The public health crisis created by the pandemic made funding for research resources available in this specific instance. Whether funding for other projects becomes available or not, Gersing hopes more organizations will feel comfortable working together.

“Before COVID, people said they couldn’t risk sharing data. There was resistance. N3C has shown we can do this,” he says. “These sites all generously allowed it, with conditions, but that is still a big deal. We can do things from different electronic health records and different common data models. That’s why public health hasn’t been working — data was so diffuse. However, we can do this.”

gorodenkoff/Getty Images

Become an Insider

Unlock white papers, personalized recommendations and other premium content for an in-depth look at evolving IT