In today’s world more and more applications and associated data are being migrated to the Cloud. For instance, in Canada, the Canadian government has announced plans to consolidate hundreds of data centers into a government run cloud. (http://news.gc.ca/web/article-eng.do?nid=614499). One of the major questions, consumers ask is how private is their data?
Most data centers have several layers of physical as well as logical security built into them. Data that is stored can even be encrypted. But how do you prevent an individual’s privacy when doing research? This questions keeps coming up in various scenarios. I am going to talk about one such situation that recently cropped up in a meeting between Riverbed Technology and one of our major Systems Integrators.
I learned about a very interesting HIPAA compliance challenge. Our Security Architect, from our CTO office, is on the privacy board of a health care research program. He was talking to our partner about a specific “privacy” challenge involved in de-identifying clinical records so that researchers can use the data without discovering who the individual patients are. On the back of a napkin (literally), he created a small record set:
|
Name
|
Zip
|
DoB
|
Pain Medication
|
Treatment
|
|
Alice
|
98761
|
09/01/1960
|
Asprin
|
Leg Pain
|
|
Bob
|
98712
|
01/01/1958
|
Ibuprofen
|
Leg Pain
|
|
Joe
|
98743
|
12/15/1955
|
Motrin
|
Phantom leg pain
|
|
Smith
|
98785
|
01/01/1955
|
Tylenol
|
Phantom leg pain
|
|
Walker
|
98735
|
01/01/1971
|
Motrin
|
Pain in amputated leg
|
Obviously this data is identifiable! Easy algorithms exist to obfuscate the most obvious elements. Such algorithms would produce this record set:
|
Name
|
Zip
|
YoB
|
Pain Medication
|
Treatment
|
|
******
|
987**
|
1960
|
Asprin
|
Leg Pain
|
|
******
|
987**
|
1958
|
Ibuprofen
|
Leg Pain
|
|
******
|
987**
|
1955
|
Motrin
|
Phantom leg pain
|
|
******
|
987**
|
1955
|
Tylenol
|
Phantom leg pain
|
|
******
|
987**
|
1971
|
Motrin
|
Pain in amputated leg
|
For many kinds of analysis, broad geographic area and birth year are important. You’d guess that even a researcher who resided in a location with a 987 zip code would be unable to identify individual records. Lots of people experience pain, and it’s reasonable to assume that a large subset of patients born before 1960 might opt for marijuana treatments.
But note that the 2 folks who take Marijuana take it for Phantom Leg Pain and Pain in Amputated leg.
Phantom leg pain is pretty rare, occurring only in those with missing limbs—which has the unfortunate characteristic of being very obvious. It is very similar to the pain described as Amputated leg pain. Do you think a researcher living in 987 might know who this amputee is? Probably. Does that researcher now know more than he/she needs to know about this person? Definitely. In this case, the de-identification requirement failed, and HIPAA compliance is at risk.
Stingray Traffic Script to the rescue
With a simple TrafficScript in Stingray Traffic Manager, you can block the transmission of data containing combinations of elements that would, in this case, carry the risk of uniquely identifying specific individuals. For example, from a list of diagnoses and a list of treatments, you could create a dataset of potentially identifying pairs. Then, whenever this pair appears in the output of an application, your rule can follow whatever policy you’ve decided is appropriate. For example, your policy could require:
· Deleting the entire record
· Removing the pair
· Removing just the diagnosis
· Removing just the treatment
Each of these choices requires making some kind of tradeoff that lowers the total value of the dataset to the researcher. The overarching point to remember is that security decisions are best made in consultation with business units and other stakeholders. Once those decisions are made, then you can configure Stingray Traffic Manager to enforce your policy.
Here’s an example TrafficScript that would provide the data to the researcher but would still protect the individuals.
|
$keywords = data.get ( "runonce" ); # retrieve keywords from memory if ( !$keywords ) { $keywords = resource.getLines("HipaaWords"); # read keywords from disk data.set("runonce", $keywords); # save keywords in memory } $response = http.getResponseBody(); # get the response foreach ($word in $keywords) { $response = string.replaceAllI($response, $word, ""); # remove keyword } http.setResponseBody($response); # update the modified response
|
The end result of running this script is a results table as follows:
|
Name
|
Zip
|
YoB
|
Pain Medication
|
Treatment
|
|
******
|
987**
|
1960
|
Asprin
|
Leg Pain
|
|
******
|
987**
|
1958
|
Ibuprofen
|
Leg Pain
|
|
******
|
987**
|
1955
|
Motrin
|
leg pain
|
|
******
|
987**
|
1955
|
Tylenol
|
leg pain
|
|
******
|
987**
|
1971
|
Motrin
|
Pain in leg
|
The first thing we do is create a simple table of all the words and phrases that we don’t want to display in the end result. I call this table the “HipaaWords”. Four our example this is essentially a text file having the following 2 lines
Hipaa Words text file:
Phantom
amputated
The advantage of keeping this in a file is:
- A text file can be easily loaded into the reference section of the Stingray Traffic Manager.
- As our selection criteria changes, it is easy to change the contents of the file
Authors: Raja Srinivasan, Jim Young