Center for Clinical Informatics, Stanford University, Stanford, California 94305, USA.
J Am Med Inform Assoc. 2012 Jun;19(e1):e157-61. doi: 10.1136/amiajnl-2011-000329. Epub 2012 Feb 1.
To address the challenge of balancing privacy with the need to create cross-site research registry records on individual patients, while matching the data for a given patient as he or she moves between participating sites. To evaluate the strategy of generating anonymous identifiers based on real identifiers in such a way that the chances of a shared patient being accurately identified were maximized, and the chances of incorrectly joining two records belonging to different people were minimized.
Our hypothesis was that most variation in names occurs after the first two letters, and that date of birth is highly reliable, so a single match variable consisting of a hashed string built from the first two letters of the patient's first and last names plus their date of birth would have the desired characteristics. We compared and contrasted the match algorithm characteristics (rate of false positive v. rate of false negative) for our chosen variable against both Social Security Numbers and full names.
In a data set of 19 000 records, a derived match variable consisting of a 2-character prefix from both first and last names combined with date of birth has a 97% sensitivity; by contrast, an anonymized identifier based on the patient's full names and date of birth has a sensitivity of only 87% and SSN has sensitivity 86%.
The approach we describe is most useful in situations where privacy policies preclude the full exchange of the identifiers required by more sophisticated and sensitive linkage algorithms. For data sets of sufficiently high quality this effective approach, while producing a lower rate of matching than more complex algorithms, has the merit of being easy to explain to institutional review boards, adheres to the minimum necessary rule of the HIPAA privacy rule, and is faster and less cumbersome to implement than a full probabilistic linkage.
解决在保护隐私的同时,为个体患者创建跨站点研究注册记录的挑战,同时匹配患者在参与站点之间移动时的数据。评估基于真实标识符生成匿名标识符的策略,以最大化共享患者被准确识别的机会,并最小化错误地将属于不同人的两个记录合并的机会。
我们的假设是,名字的大多数变化发生在前两个字母之后,出生日期是高度可靠的,因此,由患者的名字的前两个字母加上他们的出生日期组成的哈希字符串构建的单一匹配变量将具有所需的特征。我们比较和对比了我们选择的变量与社会安全号码和全名的匹配算法特征(假阳性率与假阴性率)。
在一个包含 19000 条记录的数据集中,由姓氏和名字的前两个字符加上出生日期组成的派生匹配变量的灵敏度为 97%;相比之下,基于患者全名和出生日期的匿名标识符的灵敏度仅为 87%,而社会安全号码的灵敏度为 86%。
我们描述的方法在隐私政策排除更复杂和敏感的链接算法所需标识符的完全交换的情况下最有用。对于质量足够高的数据集,这种有效的方法虽然产生的匹配率低于更复杂的算法,但具有易于向机构审查委员会解释的优点,符合 HIPAA 隐私规则的最小必要规则,并且比完整的概率链接更快、更不繁琐。