个人标识符再合成对临床文本去识别的影响。

Effects of personal identifier resynthesis on clinical text de-identification.

机构信息

Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey.

出版信息

J Am Med Inform Assoc. 2010 Mar-Apr;17(2):159-68. doi: 10.1136/jamia.2009.002212.

DOI:10.1136/jamia.2009.002212

PMID:20190058

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3000784/

Abstract

OBJECTIVE

De-identified medical records are critical to biomedical research. Text de-identification software exists, including "resynthesis" components that replace real identifiers with synthetic identifiers. The goal of this research is to evaluate the effectiveness and examine possible bias introduced by resynthesis on de-identification software.

DESIGN

We evaluated the open-source MITRE Identification Scrubber Toolkit, which includes a resynthesis capability, with clinical text from Vanderbilt University Medical Center patient records. We investigated four record classes from over 500 patients' files, including laboratory reports, medication orders, discharge summaries and clinical notes. We trained and tested the de-identification tool on real and resynthesized records.

MEASUREMENTS

We measured performance in terms of precision, recall, F-measure and accuracy for the detection of protected health identifiers as designated by the HIPAA Safe Harbor Rule.

RESULTS

The de-identification tool was trained and tested on a collection of real and resynthesized Vanderbilt records. Results for training and testing on the real records were 0.990 accuracy and 0.960 F-measure. The results improved when trained and tested on resynthesized records with 0.998 accuracy and 0.980 F-measure but deteriorated moderately when trained on real records and tested on resynthesized records with 0.989 accuracy 0.862 F-measure. Moreover, the results declined significantly when trained on resynthesized records and tested on real records with 0.942 accuracy and 0.728 F-measure.

CONCLUSION

The de-identification tool achieves high accuracy when training and test sets are homogeneous (ie, both real or resynthesized records). The resynthesis component regularizes the data to make them less "realistic," resulting in loss of performance particularly when training on resynthesized data and testing on real data.

摘要

目的

去识别的医疗记录对生物医学研究至关重要。文本去识别软件已经存在，包括用合成标识符替换真实标识符的“重合成”组件。本研究的目的是评估去识别软件的有效性，并检查重合成引入的可能偏差。

设计

我们评估了包括重合成功能在内的开源 MITRE 识别清理工具包，该工具包使用范德比尔特大学医学中心患者记录中的临床文本。我们研究了来自 500 多个患者文件的四个记录类别，包括实验室报告、药物医嘱、出院总结和临床笔记。我们在真实和重合成记录上训练和测试了去识别工具。

测量

我们根据 HIPAA 安全港规则指定的受保护健康标识符的检测来衡量精度、召回率、F 度量和准确性。

结果

该去识别工具是在真实和重合成的范德比尔特记录集合上进行训练和测试的。在真实记录上进行训练和测试的结果分别为 0.990 的准确性和 0.960 的 F 度量。当在重合成记录上进行训练和测试时，结果提高到 0.998 的准确性和 0.980 的 F 度量，但当在真实记录上进行训练并在重合成记录上进行测试时，结果适度恶化，准确率为 0.989，F 度量为 0.862。此外，当在重合成记录上进行训练并在真实记录上进行测试时，结果显著下降，准确率为 0.942，F 度量为 0.728。

结论

当训练集和测试集同质时（即都是真实记录或重合成记录），去识别工具的准确性很高。重合成组件使数据正则化，使其不那么“真实”，从而导致性能下降，特别是在使用重合成数据进行训练并在真实数据上进行测试时。

相似文献

Effects of personal identifier resynthesis on clinical text de-identification.个人标识符再合成对临床文本去识别的影响。

J Am Med Inform Assoc. 2010 Mar-Apr;17(2):159-68. doi: 10.1136/jamia.2009.002212.

Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化

BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.

Bootstrapping a de-identification system for narrative patient records: cost-performance tradeoffs.为叙事性患者记录构建去识别系统：成本效益权衡。

Int J Med Inform. 2013 Sep;82(9):821-31. doi: 10.1016/j.ijmedinf.2013.03.005. Epub 2013 Apr 30.

Semi-automated De-identification of German Content Sensitive Reports for Big Data Analytics.用于大数据分析的德语内容敏感报告的半自动去识别化处理

Rofo. 2017 Jul;189(7):661-671. doi: 10.1055/s-0043-102939. Epub 2017 Mar 23.

De-identification of Address, Date, and Alphanumeric Identifiers in Narrative Clinical Reports.病历叙述报告中地址、日期及字母数字标识符的去识别化处理

AMIA Annu Symp Proc. 2014 Nov 14;2014:767-76. eCollection 2014.

deidentify.去识别化

AMIA Annu Symp Proc. 2018 Apr 16;2017:485-494. eCollection 2017.

De-identification of primary care electronic medical records free-text data in Ontario, Canada.加拿大安大略省初级保健电子病历自由文本数据的去识别化。

BMC Med Inform Decis Mak. 2010 Jun 18;10:35. doi: 10.1186/1472-6947-10-35.

The MITRE Identification Scrubber Toolkit: design, training, and assessment.MITRE 识别清理工具包：设计、培训和评估。

Int J Med Inform. 2010 Dec;79(12):849-59. doi: 10.1016/j.ijmedinf.2010.09.007. Epub 2010 Oct 14.

Proposal and evaluation of FASDIM, a Fast And Simple De-Identification Method for unstructured free-text clinical records.提出并评估了 FASDIM，一种用于非结构化自由文本临床记录的快速简便去识别方法。

Int J Med Inform. 2014 Apr;83(4):303-12. doi: 10.1016/j.ijmedinf.2013.11.005. Epub 2013 Dec 7.

Evaluating current automatic de-identification methods with Veteran's health administration clinical documents.评估退伍军人健康管理局临床文档中当前的自动去识别方法。

BMC Med Res Methodol. 2012 Jul 27;12:109. doi: 10.1186/1471-2288-12-109.

引用本文的文献

End-to-end pseudonymization of fine-tuned clinical BERT models : Privacy preservation with maintained data utility.端到端微调临床 BERT 模型的化名化：保持数据效用的隐私保护。

BMC Med Inform Decis Mak. 2024 Jun 12;24(1):162. doi: 10.1186/s12911-024-02546-8.

Artificial Intelligence and Healthcare Simulation: The Shifting Landscape of Medical Education.人工智能与医疗模拟：医学教育的变革格局

Cureus. 2024 May 6;16(5):e59747. doi: 10.7759/cureus.59747. eCollection 2024 May.

Building a best-in-class automated de-identification tool for electronic health records through ensemble learning.通过集成学习构建用于电子健康记录的一流自动去识别工具。

Patterns (N Y). 2021 May 12;2(6):100255. doi: 10.1016/j.patter.2021.100255. eCollection 2021 Jun 11.

Secondary Use of Clinical Data in Data-Gathering, Non-Interventional Research or Learning Activities: Definition, Types, and a Framework for Risk Assessment.临床数据的二次使用在数据收集、非干预性研究或学习活动中的应用：定义、类型和风险评估框架。

J Med Internet Res. 2021 Jun 8;23(6):e26631. doi: 10.2196/26631.

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.临床去标识文本的“以明掩暗”抵御人类读者敌对重新识别攻击的弹性。

J Am Med Inform Assoc. 2020 Jul 1;27(9):1374-1382. doi: 10.1093/jamia/ocaa095.

The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight.机器给予，机器又夺走：隐藏在明处的鹦鹉攻击对临床文本去识别。

J Am Med Inform Assoc. 2019 Dec 1;26(12):1536-1544. doi: 10.1093/jamia/ocz114.

Efficient Active Learning for Electronic Medical Record De-identification.用于电子病历去识别化的高效主动学习

AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:462-471. eCollection 2019.

Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.榨取成果是否值得？多名人工标注者在临床文本去识别化中的成本与收益

Methods Inf Med. 2016 Aug 5;55(4):356-64. doi: 10.3414/ME15-01-0122. Epub 2016 Jul 13.

Optimizing annotation resources for natural language de-identification via a game theoretic framework.通过博弈论框架优化用于自然语言去识别的注释资源。

J Biomed Inform. 2016 Jun;61:97-109. doi: 10.1016/j.jbi.2016.03.019. Epub 2016 Mar 25.

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.准备一个带注释的金标准语料库，以便与校外研究人员共享用于去识别化研究。

J Biomed Inform. 2014 Aug;50:173-183. doi: 10.1016/j.jbi.2014.01.014. Epub 2014 Feb 17.

本文引用的文献

Repurposing the clinical record: can an existing natural language processing system de-identify clinical notes?重新利用临床记录：现有的自然语言处理系统能否对临床笔记进行去识别化处理？

J Am Med Inform Assoc. 2009 Jan-Feb;16(1):37-9. doi: 10.1197/jamia.M2862. Epub 2008 Oct 24.

Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化

BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.

A software tool for removing patient identifying information from clinical documents.从临床文档中删除患者识别信息的软件工具。

J Am Med Inform Assoc. 2008 Sep-Oct;15(5):601-10. doi: 10.1197/jamia.M2702. Epub 2008 Jun 25.

Development of a large-scale de-identified DNA biobank to enable personalized medicine.开发一个大规模的去识别化DNA生物样本库以实现个性化医疗。

Clin Pharmacol Ther. 2008 Sep;84(3):362-9. doi: 10.1038/clpt.2008.89. Epub 2008 May 21.

A de-identifier for medical discharge summaries.一份用于出院小结的去标识信息。

Artif Intell Med. 2008 Jan;42(1):13-35. doi: 10.1016/j.artmed.2007.10.001. Epub 2007 Nov 28.

State-of-the-art anonymization of medical records using an iterative machine learning framework.使用迭代机器学习框架对病历进行最先进的匿名化处理。

J Am Med Inform Assoc. 2007 Sep-Oct;14(5):574-80. doi: 10.1197/j.jamia.M2441.

Rapidly retargetable approaches to de-identification in medical records.医疗记录中快速可重新定位的去识别方法。

J Am Med Inform Assoc. 2007 Sep-Oct;14(5):564-73. doi: 10.1197/jamia.M2435. Epub 2007 Jun 28.

Evaluating the state-of-the-art in automatic de-identification.评估自动去识别技术的最新进展。

J Am Med Inform Assoc. 2007 Sep-Oct;14(5):550-63. doi: 10.1197/jamia.M2444. Epub 2007 Jun 28.

A framework for clinical communication supporting healthcare delivery.一个支持医疗服务提供的临床沟通框架。

AMIA Annu Symp Proc. 2005;2005:375-9.

Assessing the difficulty and time cost of de-identification in clinical narratives.评估临床记录中去识别化的难度和时间成本。

Methods Inf Med. 2006;45(3):246-52.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验