How Real a Threat Is “De-Anonymization”?

“De-anonymization” is an ugly word and a scary concept. The idea is that if enough anonymous information about an individual is collected, mining the data can create a profile of a unique person, which can then be linked back to other public information to attach an identity to the data. This is clearly possible in theory and  has been demonstrated in research. But how much of a problem is it in the real world? And is the threat great enough to justfy restrictions on the collection of “non-personally identifiable information”?

The  question is not academic. It plays an important part in a case now before the U.S. Supreme Court, Sorrell v. IMS Health Inc. The case involved a challenge to a Vermont law that prohibits the sale of information on prescriptions identifying the prescriber without the doctor’s consent. On its face, the case is about doctors and commercial free speech, not patients and privacy. But amici curiae briefs filed by the Electronic Privacy Information Center, the Electronic Frontier Foundation, and others argue that the real issue is patient privacy.

As EPIC argues in its brief:

The “deidentification” technique adopted by the Respondents in this matter does not adequately safeguard the medical privacy of Vermont residents or the residents of other states whose personal prescribing information could be obtained by data-mining firms and subsequently sold to pharmaceutical companies. These records include the prescriber’s name and address; the name, dosage, and quantity of the drug prescribed; the date and location at which the prescription was filled; and the patient’s age and gender. The only missing element – the patient’s actual name – is concealed by a weak cryptographic technique* that does not actually prevent reidentification of the patient by Respondent. In such circumstance, the Vermont law, and the many other similar state confidentiality laws, seek to safeguard personal information that is without question among the most sensitive and most deserving of protection.

Jane Yakowitz, a Brooklyn Law School professor and advocate of the notion of a “data commons,” challenges this premise in a new paper, “The Illusory Privacy Problem in Sorrell v. IMS Health,” written with Daniel Barth-Jones, a Columbia University epidemiologist. The paper concedes the possibility that information in prescription records could be linked back to individual patients, but argues that the probability of that happening is vanishingly small. While research by Latanya Sweeney of Carnegie Mellon University has shown that a five-digit zip code, date of birth, and gender are suffiicent to identify an individual uniquely about 63% of the time, the IMS Health data contain only a three-digit zip code and the year, not day, of birth. That reduces the likelihood of identification by several orders of magnitude.

The issue raise by Yakowitz, especially in her data commons paper, is that the risk of de-anonymization has to be weighed against the benefits of amassing the data. Admittedly, there’s not much loss to society if IMS Health can’t sell prescription data to marketers. But there could be a considerable loss of researchers loose access to to great masses of aggregated data. We are just at the point where the collection and analysis of vast amounts of data is becoming routinely practical. While there may be considerable risks in assembling that data, there is also a wealth of information about ourselves and our society that could be obtained from them. The debate must weigh both benefits and risks.

*–EPIC’s complaint about the weakness of the anonymization of patient identifiers is a red herring. IMS Health uses the MD5 algorithm to create a “hash” of the patient identifier. It  is true that that there are significant weaknesses in MD5 and its use has been widely replaced by newer hash algorithms. But the concerns have focused mainly on the use of MD5 in creating digital signatures. Going from an MD5 hash back to an individual patient identifier remains extremely difficult to the point where it has virtually no effect on the possibility of re-linking.

 

 

Leave a comment