Sweeney Latanya, Yoo Ji Su, Perovich Laura, Boronow Katherine E, Brown Phil, Brody Julia Green
Harvard University, Cambridge, MA.
MIT Media Lab, Cambridge, MA.
Technol Sci. 2017;2017. Epub 2017 Aug 28.
Researchers are increasingly asked to share research data as part of publication and funding processes and to maximize the benefits of publicly funded research. The Safe Harbor provision of the U.S. Health Information Portability and Accountability Act (HIPAA) offers guidance to researchers by prescribing how to redact data for public sharing. For example, the provision requires removing explicit identifiers (such as name, address and other personally identifiable information), reporting dates in years, and reducing some or all digits of a postal (or ZIP) code. Is this sufficient? Can research participants still be re-identified in research data that adhere to the HIPAA Safe Harbor standard? In 2006, researchers collected air and dust samples and interviewed residents of 50 homes from Bolinas and Richmond (Atchison Village and Liberty Village), California, to analyze the residents' exposure to pollutants. The study, known as the Northern California Household Exposure Study [1], led to publications that have been cited hundreds of times. We conducted experiments with separate "attacker" and "scorer" teams to see whether we could identify study participants from two versions of the data redacted beyond the HIPAA standard, one in which all dates were reported in ranges of 10 or 20 years and another in which a study participant's birth year was reported exactly. The attackers were blinded to the names and addresses of the participants, and the scorers were blinded to the strategy.
作为发表论文和申请资金流程的一部分,研究人员越来越多地被要求分享研究数据,以实现公共资助研究效益的最大化。美国《健康保险流通与责任法案》(HIPAA)中的“安全港”条款为研究人员提供了指导,规定了如何对用于公开共享的数据进行编辑。例如,该条款要求去除明确的标识符(如姓名、地址和其他个人身份信息),按年份报告日期,并减少邮政编码的部分或所有数字。这就足够了吗?在遵循HIPAA安全港标准的研究数据中,研究参与者是否仍能被重新识别?2006年,研究人员收集了空气和灰尘样本,并采访了加利福尼亚州博利纳斯和里士满(阿奇森村和自由村)50户家庭的居民,以分析居民接触污染物的情况。这项名为“北加州家庭接触研究”[1]的研究发表的论文已被引用数百次。我们分别组建了“攻击者”和“评分者”团队进行实验,看看能否从两个版本的超出HIPAA标准编辑的数据中识别出研究参与者,一个版本是所有日期按10年或20年的范围报告,另一个版本是准确报告研究参与者的出生年份。攻击者不知道参与者的姓名和地址,评分者也不知道策略。