What Is De-Identified Data?
While de-identification and anonymization processes both look to remove key identifiers from data, they take different approaches that result in differing outcomes.
According to Victor Lee, vice president of machine learning and AI at TigerGraph, “de-identification is an important capability. It looks at a single item and removes sensitive information such as the person’s name or social security number, so outsiders can’t tell who it is. What’s considered sensitive depends on the use case. In the case of clinical trials, it could be a patient’s current health information or medical history.”
De-identified data may pose several problems for aggregate studies, however. First is the removal of key demographic information that could be used to pinpoint statistical significance. Second is the issue of re-identification. With access to large amounts of de-identified data and publicly available data sets, it may be possible to match de-identified records with their respective owners.
Lee highlights the example of a Netflix contest from over a decade ago. “The company had a contest to improve its algorithm and provided de-identified data about user viewing habits,” he says. A researcher paired this data with external information — such as social media comments about rarely watched movies — to re-identify viewers.”
EXPLORE: How to use clinical decision support systems to improve patient outcomes.
What Is Anonymized Data?
Anonymized data takes this identity obfuscation a step further without sacrificing statistical meaning.
“Techniques include converting specific record values to ranges which help generalize information or intentionally introducing dummy records,” says Lee. “Organizations need to identify sensitive parameters and which data is significant, and then anonymize data such that it retains statistical significance.”
Another technique is what’s known as k-anonymity, where the value of “k” indicates the number of records that share potentially identifying variables. For example, if k equals five, this means the smallest set of records that have identical data across a specific set of variables is five, in turn making it harder to de-anonymize patient data.
How Do De-Identified and Anonymized Data Drive Clinical Trials?
These data sets make it possible for researchers to perform relevant analysis and comparison without compromising patient privacy. What’s more, this process can happen quickly since data can be fed into purpose-built trial algorithms rather than being scrutinized by staff at every step of the trial process. Leveraging de-identification and anonymization techniques can also help healthcare organizations ensure compliance with evolving regulations such as the General Data Protection Regulation, the California Consumer Privacy Act and HIPAA.
For CCPA, companies must take steps to de-identify information such that it cannot “reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular consumer.” HIPAA also requires de-identification using one of two methods: safe harbor or expert determination.