Data Analytics

De-Identified and Anonymized Patient Data: Understanding the Difference and How They Drive Clinical Trials

The two data types have key difference in the way they protect patient identity, but both can be used to expedite clinical trial efforts.

Doug Bonderud

Doug Bonderud is an award-winning writer capable of bridging the gap between complex and conversational across technology, innovation and the human condition.

The pandemic has been hard on healthcare. Shifting to remote work introduced a host of new challenges for IT teams, while evolving restrictions meant a backlog in critical patient procedures. This new normal has also impacted clinical trials. As noted by Center Watch, while more than 5,500 trials were planned in 2021 — 14 percent more than in 2020 and 19 percent more than in 2019 — the sheer number of these trials combined with ongoing staff shortages has created a logjam in trial efforts.

De-identified and anonymized data offer a way for healthcare organizations to expedite clinical trial efforts — if they’re properly implemented and applied. The caveat? They’re not identical. It’s important to understand the key data differences, use cases and best practices for improving clinical trial outcomes.

Click the banner below for access to exclusive HealthTech content and a customized experience.

What Is De-Identified Data?

While de-identification and anonymization processes both look to remove key identifiers from data, they take different approaches that result in differing outcomes.

According to Victor Lee, vice president of machine learning and AI at TigerGraph, “de-identification is an important capability. It looks at a single item and removes sensitive information such as the person’s name or social security number, so outsiders can’t tell who it is. What’s considered sensitive depends on the use case. In the case of clinical trials, it could be a patient’s current health information or medical history.”

De-identified data may pose several problems for aggregate studies, however. First is the removal of key demographic information that could be used to pinpoint statistical significance. Second is the issue of re-identification. With access to large amounts of de-identified data and publicly available data sets, it may be possible to match de-identified records with their respective owners.

Lee highlights the example of a Netflix contest from over a decade ago. “The company had a contest to improve its algorithm and provided de-identified data about user viewing habits,” he says. A researcher paired this data with external information — such as social media comments about rarely watched movies — to re-identify viewers.”

EXPLORE: How to use clinical decision support systems to improve patient outcomes.

What Is Anonymized Data?

Anonymized data takes this identity obfuscation a step further without sacrificing statistical meaning.

“Techniques include converting specific record values to ranges which help generalize information or intentionally introducing dummy records,” says Lee. “Organizations need to identify sensitive parameters and which data is significant, and then anonymize data such that it retains statistical significance.”

Another technique is what’s known as k-anonymity, where the value of “k” indicates the number of records that share potentially identifying variables. For example, if k equals five, this means the smallest set of records that have identical data across a specific set of variables is five, in turn making it harder to de-anonymize patient data.

How Do De-Identified and Anonymized Data Drive Clinical Trials?

These data sets make it possible for researchers to perform relevant analysis and comparison without compromising patient privacy. What’s more, this process can happen quickly since data can be fed into purpose-built trial algorithms rather than being scrutinized by staff at every step of the trial process. Leveraging de-identification and anonymization techniques can also help healthcare organizations ensure compliance with evolving regulations such as the General Data Protection Regulation, the California Consumer Privacy Act and HIPAA.

For CCPA, companies must take steps to de-identify information such that it cannot “reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular consumer.” HIPAA also requires de-identification using one of two methods: safe harbor or expert determination.

Anonymity is not black and white. It’s a scaled measure that comes with a trade-off between precision and privacy.”

Victor Lee Vice President of Machine Learning and AI, TigerGraph

Under safe harbor methods, organizations must remove a host of potential identifiers, including names, email addresses, IP addresses, social security numbers, account numbers and any biometric identifiers. Expert determination, meanwhile, requires the evaluation of de-identifying techniques by someone with knowledge and experience in this area to verify that the overall risk of re-identification is small. In the case of GDPR, data anonymization is required to ensure that an individual’s personal data cannot be reconstructed and used.

In practice, both de-identification and anonymization help clear the way for improved clinical trial speed without sacrificing patient privacy. It’s worth noting, however, that regulatory obligations are a moving target: While CCPA and HIPAA currently require de-identification, this may change as larger and larger data sets are leveraged to inform new healthcare efforts.

As Lee notes, de-identification is now considered the “base expectation” for data handling in clinical trials. And while it’s possible to strike a balance between privacy and clinical trial procedures using this method, statistical analysis at scale typically leans on anonymization.

What Are Best Practices for Applying These Data Frameworks?

When it comes to anonymizing or de-identifying data, Lee suggests starting with an evaluation of the use case: How many records are required, what types of sensitive data are being measured and which regulations apply?

Next, he recommends identifying the best-fit techniques for removing key identifiers, such as generalizing data sets, inserting dummy records or introducing random “noise” that limits re-identification risk without disrupting key data values. Here, the use of pre-built, pre-programmed algorithms can help ensure the number of people who know how data is being transformed remains as low as possible.

Finally, Lee makes it clear that “anonymity is not black and white. It’s a scaled measure that comes with a trade-off between precision and privacy.”

Keep Patient Data Significant and Safe

Data de-identification and anonymization make it possible for healthcare organizations to expedite clinical trial analysis without putting patient information at risk. Which approach best fits trial efforts depends on the nature of data collected and used, any applicable compliance legislation, and the volume of records required to achieve statistical significance.

gorodenkoff/Getty Images

Become an Insider

Sign up today to receive premium content!

HealthTech Magazine

De-Identified and Anonymized Patient Data: Understanding the Difference and How They Drive Clinical Trials

What Is De-Identified Data?

What Is Anonymized Data?

How Do De-Identified and Anonymized Data Drive Clinical Trials?

What Are Best Practices for Applying These Data Frameworks?

Keep Patient Data Significant and Safe

Extending IAM and Zero Trust to All Administrative Accounts

30 Healthcare IT Influencers Worth a Follow in 2025

Mergers and Acquisitions: An Overview of Notable Healthcare M&A Activity in 2025

New AI Research From CDW

What Is De-Identified Data?

What Is Anonymized Data?

How Do De-Identified and Anonymized Data Drive Clinical Trials?

What Are Best Practices for Applying These Data Frameworks?

Keep Patient Data Significant and Safe

More On

Related Articles