Sokolova Marina, El Emam Khaled, Arbuckle Luk, Neri Emilio, Rose Sean, Jonker Elizabeth
Electronic Health Information Laboratory, CHEO Research Institute, Ottawa, ON, Canada.
J Med Internet Res. 2012 Jul 9;14(4):e95. doi: 10.2196/jmir.1898.
Users of peer-to-peer (P2P) file-sharing networks risk the inadvertent disclosure of personal health information (PHI). In addition to potentially causing harm to the affected individuals, this can heighten the risk of data breaches for health information custodians. Automated PHI detection tools that crawl the P2P networks can identify PHI and alert custodians. While there has been previous work on the detection of personal information in electronic health records, there has been a dearth of research on the automated detection of PHI in heterogeneous user files.
To build a system that accurately detects PHI in files sent through P2P file-sharing networks. The system, which we call P2P Watch, uses a pipeline of text processing techniques to automatically detect PHI in files exchanged through P2P networks. P2P Watch processes unstructured texts regardless of the file format, document type, and content.
We developed P2P Watch to extract and analyze PHI in text files exchanged on P2P networks. We labeled texts as PHI if they contained identifiable information about a person (eg, name and date of birth) and specifics of the person's health (eg, diagnosis, prescriptions, and medical procedures). We evaluated the system's performance through its efficiency and effectiveness on 3924 files gathered from three P2P networks.
P2P Watch successfully processed 3924 P2P files of unknown content. A manual examination of 1578 randomly selected files marked by the system as non-PHI confirmed that these files indeed did not contain PHI, making the false-negative detection rate equal to zero. Of 57 files marked by the system as PHI, all contained both personally identifiable information and health information: 11 files were PHI disclosures, and 46 files contained organizational materials such as unfilled insurance forms, job applications by medical professionals, and essays.
PHI can be successfully detected in free-form textual files exchanged through P2P networks. Once the files with PHI are detected, affected individuals or data custodians can be alerted to take remedial action.
对等(P2P)文件共享网络的用户面临个人健康信息(PHI)被无意泄露的风险。这不仅可能对受影响的个人造成伤害,还会增加健康信息保管者数据泄露的风险。在P2P网络中进行爬取的自动化PHI检测工具可以识别PHI并向保管者发出警报。虽然之前有关于电子健康记录中个人信息检测的工作,但在异构用户文件中对PHI进行自动化检测的研究却很匮乏。
构建一个能够准确检测通过P2P文件共享网络发送的文件中PHI的系统。我们将该系统称为P2P Watch,它使用一系列文本处理技术来自动检测通过P2P网络交换的文件中的PHI。P2P Watch可以处理非结构化文本,而不考虑文件格式、文档类型和内容。
我们开发了P2P Watch,用于提取和分析在P2P网络上交换的文本文件中的PHI。如果文本包含有关个人的可识别信息(如姓名和出生日期)以及个人健康的详细信息(如诊断、处方和医疗程序),我们将其标记为PHI。我们通过对从三个P2P网络收集的3924个文件进行效率和有效性评估来评价该系统的性能。
P2P Watch成功处理了3924个内容未知的P2P文件。对系统标记为非PHI的1578个随机选择的文件进行人工检查,确认这些文件确实不包含PHI,使得假阴性检测率为零。在系统标记为PHI的57个文件中,所有文件都包含个人可识别信息和健康信息:11个文件是PHI泄露,46个文件包含组织材料,如未填写的保险表格、医疗专业人员的求职申请和论文。
在通过P2P网络交换的自由格式文本文件中可以成功检测到PHI。一旦检测到包含PHI的文件,就可以提醒受影响的个人或数据保管者采取补救措施。