Harvard Medical School Center for Biomedical Informatics, Boston, MA USA.
BMC Med Inform Decis Mak. 2013 Oct 2;13:112. doi: 10.1186/1472-6947-13-112.
Physician notes routinely recorded during patient care represent a vast and underutilized resource for human disease studies on a population scale. Their use in research is primarily limited by the need to separate confidential patient information from clinical annotations, a process that is resource-intensive when performed manually. This study seeks to create an automated method for de-identifying physician notes that does not require large amounts of private information: in addition to training a model to recognize Protected Health Information (PHI) within private physician notes, we reverse the problem and train a model to recognize non-PHI words and phrases that appear in public medical texts.
Public and private medical text sources were analyzed to distinguish common medical words and phrases from Protected Health Information. Patient identifiers are generally nouns and numbers that appear infrequently in medical literature. To quantify this relationship, term frequencies and part of speech tags were compared between journal publications and physician notes. Standard medical concepts and phrases were then examined across ten medical dictionaries. Lists and rules were included from the US census database and previously published studies. In total, 28 features were used to train decision tree classifiers.
The model successfully recalled 98% of PHI tokens from 220 discharge summaries. Cost sensitive classification was used to weight recall over precision (98% F10 score, 76% F1 score). More than half of the false negatives were the word "of" appearing in a hospital name. All patient names, phone numbers, and home addresses were at least partially redacted. Medical concepts such as "elevated white blood cell count" were informative for de-identification. The results exceed the previously approved criteria established by four Institutional Review Boards.
The results indicate that distributional differences between private and public medical text can be used to accurately classify PHI. The data and algorithms reported here are made freely available for evaluation and improvement.
在患者护理过程中记录的医生笔记代表了一个巨大的、未被充分利用的资源,可用于对人群进行大规模的人类疾病研究。它们在研究中的应用主要受到从临床注释中分离保密患者信息的需求的限制,而当手动执行此过程时,需要大量资源。本研究旨在创建一种自动识别医生笔记的方法,该方法不需要大量的私人信息:除了训练模型来识别私人医生笔记中的受保护健康信息(PHI)之外,我们还反向解决问题,训练模型来识别出现在公共医疗文本中的非 PHI 单词和短语。
分析公共和私人医疗文本来源,以区分常见的医学单词和短语与受保护的健康信息。患者标识符通常是在医学文献中很少出现的名词和数字。为了量化这种关系,比较了期刊出版物和医生笔记中的术语频率和词性标签。然后,在十个医学词典中检查了标准的医学概念和短语。从美国人口普查数据库和以前的研究中包含了列表和规则。总共使用了 28 个特征来训练决策树分类器。
该模型成功从 220 份出院总结中召回了 98%的 PHI 令牌。使用代价敏感分类来权衡召回率与精度(98% F10 得分,76% F1 得分)。超过一半的假阴性是出现在医院名称中的单词“of”。所有患者姓名、电话号码和家庭住址都至少部分被屏蔽。诸如“白细胞计数升高”之类的医学概念对于识别是有用的。结果超过了四个机构审查委员会先前批准的标准。
结果表明,私人和公共医疗文本之间的分布差异可用于准确分类 PHI。此处报告的数据和算法可供免费评估和改进。