一种用于所有临床文本文件的经认证的去识别化系统，可大规模进行信息提取。

A certified de-identification system for all clinical text documents for information extraction at scale.

作者信息

Radhakrishnan Lakshmi, Schenk Gundolf, Muenzen Kathleen, Oskotsky Boris, Ashouri Choshali Habibeh, Plunkett Thomas, Israni Sharat, Butte Atul J

机构信息

Academic Research Services, Information Technology, University of California, San Francisco, San Francisco, California, USA.

Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California, USA.

出版信息

JAMIA Open. 2023 Jul 4;6(3):ooad045. doi: 10.1093/jamiaopen/ooad045. eCollection 2023 Oct.

DOI:10.1093/jamiaopen/ooad045

PMID:37416449

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10320112/

Abstract

OBJECTIVES

Clinical notes are a veritable treasure trove of information on a patient's disease progression, medical history, and treatment plans, yet are locked in secured databases accessible for research only after extensive ethics review. Removing personally identifying and protected health information (PII/PHI) from the records can reduce the need for additional Institutional Review Boards (IRB) reviews. In this project, our goals were to: (1) develop a robust and scalable clinical text de-identification pipeline that is compliant with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule for de-identification standards and (2) share routinely updated de-identified clinical notes with researchers.

MATERIALS AND METHODS

Building on our open-source de-identification software called Philter, we added features to: (1) make the algorithm and the de-identified data HIPAA compliant, which also implies type 2 error-free redaction, as certified via external audit; (2) reduce over-redaction errors; and (3) normalize and shift date PHI. We also established a streamlined de-identification pipeline using MongoDB to automatically extract clinical notes and provide truly de-identified notes to researchers with periodic monthly refreshes at our institution.

RESULTS

To the best of our knowledge, the Philter V1.0 pipeline is currently the and certified, de-identified redaction pipeline that makes clinical notes available to researchers for nonhuman subjects' research, without further IRB approval needed. To date, we have made over 130 million certified de-identified clinical notes available to over 600 UCSF researchers. These notes were collected over the past 40 years, and represent data from 2757016 UCSF patients.

摘要

目标

临床记录是有关患者疾病进展、病史和治疗计划的信息宝库，但这些记录被锁定在安全数据库中，只有经过广泛的伦理审查后才能用于研究。从记录中删除个人身份识别信息和受保护的健康信息（PII/PHI）可以减少机构审查委员会（IRB）额外审查的需求。在本项目中，我们的目标是：（1）开发一个强大且可扩展的临床文本去识别流程，该流程符合《健康保险流通与责任法案》（HIPAA）隐私规则的去识别标准；（2）与研究人员共享定期更新的去识别临床记录。

材料与方法

在我们名为Philter的开源去识别软件基础上，我们增加了以下功能：（1）使算法和去识别数据符合HIPAA要求，这也意味着通过外部审计认证实现无第二类错误的编辑；（2）减少过度编辑错误；（3）对日期PHI进行规范化和移位处理。我们还使用MongoDB建立了一个简化的去识别流程，以自动提取临床记录，并在我们机构每月定期更新，为研究人员提供真正去识别的记录。

结果

据我们所知，Philter V1.0流程目前是首个且唯一经过认证的去识别编辑流程，可在无需IRB进一步批准的情况下，将临床记录提供给研究人员用于非人体研究。迄今为止，我们已为600多名加州大学旧金山分校的研究人员提供了超过1.3亿条经过认证的去识别临床记录。这些记录是在过去40年中收集的，代表了来自2757016名加州大学旧金山分校患者的数据。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/257c/10320112/67ddb84ba6af/ooad045f1.jpg

相似文献

A certified de-identification system for all clinical text documents for information extraction at scale.一种用于所有临床文本文件的经认证的去识别化系统，可大规模进行信息提取。

JAMIA Open. 2023 Jul 4;6(3):ooad045. doi: 10.1093/jamiaopen/ooad045. eCollection 2023 Oct.

Automated de-identification of free-text medical records.自由文本医疗记录的自动去识别化

BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32.

Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes.受保护的健康信息过滤器（Philter）：准确且安全地去除自由文本临床记录中的身份标识信息。

NPJ Digit Med. 2020 Apr 14;3:57. doi: 10.1038/s41746-020-0258-y. eCollection 2020.

De-identification of free text data containing personal health information: a scoping review of reviews.去标识化包含个人健康信息的自由文本数据：综述的综述。

Int J Popul Data Sci. 2023 Dec 12;8(1):2153. doi: 10.23889/ijpds.v8i1.2153. eCollection 2023.

An evaluation of existing text de-identification tools for use with patient progress notes from Australian general practice.对澳大利亚全科医疗中用于患者病程记录的现有文本去识别工具的评估。

Int J Med Inform. 2023 May;173:105021. doi: 10.1016/j.ijmedinf.2023.105021. Epub 2023 Feb 11.

Text de-identification for privacy protection: a study of its impact on clinical text information content.用于隐私保护的文本去识别化：对其对临床文本信息内容影响的一项研究

J Biomed Inform. 2014 Aug;50:142-50. doi: 10.1016/j.jbi.2014.01.011. Epub 2014 Feb 3.

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.准备一个带注释的金标准语料库，以便与校外研究人员共享用于去识别化研究。

J Biomed Inform. 2014 Aug;50:173-183. doi: 10.1016/j.jbi.2014.01.014. Epub 2014 Feb 17.

Natural Language Processing for Enterprise-scale De-identification of Protected Health Information in Clinical Notes.自然语言处理在临床记录中用于企业级的保护健康信息去识别。

AMIA Jt Summits Transl Sci Proc. 2022 May 23;2022:92-101. eCollection 2022.

Evaluating current automatic de-identification methods with Veteran's health administration clinical documents.评估退伍军人健康管理局临床文档中当前的自动去识别方法。

BMC Med Res Methodol. 2012 Jul 27;12:109. doi: 10.1186/1471-2288-12-109.

Automatic de-identification of textual documents in the electronic health record: a review of recent research.电子健康记录中文本文件的自动去识别：近期研究综述。

BMC Med Res Methodol. 2010 Aug 2;10:70. doi: 10.1186/1471-2288-10-70.

引用本文的文献

Evaluating large language models for drafting emergency department encounter summaries.评估大型语言模型用于起草急诊科就诊总结。

PLOS Digit Health. 2025 Jun 17;4(6):e0000899. doi: 10.1371/journal.pdig.0000899. eCollection 2025 Jun.

Lexical associations can characterize clinical documentation trends related to palliative care and metastatic cancer.词汇关联可以表征与姑息治疗和转移性癌症相关的临床文档趋势。

Sci Rep. 2025 May 18;15(1):17245. doi: 10.1038/s41598-025-01828-z.

Physician- and Large Language Model-Generated Hospital Discharge Summaries.医生和大语言模型生成的医院出院小结

JAMA Intern Med. 2025 May 5. doi: 10.1001/jamainternmed.2025.0821.

Understanding contraceptive switching rationales from real world clinical notes using large language models.使用大语言模型从真实世界临床记录中理解避孕方法转换的基本原理。

NPJ Digit Med. 2025 Apr 23;8(1):221. doi: 10.1038/s41746-025-01615-0.

CORAL: Expert-Curated Oncology Reports to Advance Language Model Inference.CORAL：经专家策划的肿瘤学报告，以推进语言模型推理。

NEJM AI. 2024 Apr;1(4). doi: 10.1056/aidbp2300110. Epub 2024 Mar 13.

Economics and Equity of Large Language Models: Health Care Perspective.大语言模型的经济学和公平性：医疗保健视角。

J Med Internet Res. 2024 Nov 14;26:e64226. doi: 10.2196/64226.

Revealing the impact of social circumstances on the selection of cancer therapy through natural language processing of social work notes.通过对社会工作记录进行自然语言处理，揭示社会环境对癌症治疗选择的影响。

JAMIA Open. 2024 Oct 11;7(4):ooae073. doi: 10.1093/jamiaopen/ooae073. eCollection 2024 Dec.

Evaluating the use of large language models to provide clinical recommendations in the Emergency Department.评估大型语言模型在急诊科提供临床建议的应用。

Nat Commun. 2024 Oct 8;15(1):8236. doi: 10.1038/s41467-024-52415-1.

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports.基于大语言模型的零样本推理与乳腺癌病理报告任务特定监督分类的比较研究。

J Am Med Inform Assoc. 2024 Oct 1;31(10):2315-2327. doi: 10.1093/jamia/ocae146.

Using incident reports to diagnose communication challenges for precision intervention in learning health systems: A methods paper.利用事件报告诊断学习型健康系统中精准干预的沟通挑战：一篇方法学论文。

Learn Health Syst. 2024 May 9;8(Suppl 1):e10425. doi: 10.1002/lrh2.10425. eCollection 2024 Jun.

本文引用的文献

Heterogeneity of Diabetes: β-Cells, Phenotypes, and Precision Medicine: Proceedings of an International Symposium of the Canadian Institutes of Health Research's Institute of Nutrition, Metabolism and Diabetes and the U.S. National Institutes of Health's National Institute of Diabetes and Digestive and Kidney Diseases.糖尿病的异质性：β 细胞、表型和精准医学：加拿大卫生研究院营养、代谢与糖尿病研究所和美国国立卫生研究院国家糖尿病、消化和肾脏疾病研究所的国际研讨会论文集。

Diabetes Care. 2022 Jan 1;45(1):3-22. doi: 10.2337/dci21-0051.

Federated learning for predicting clinical outcomes in patients with COVID-19.基于联邦学习的 COVID-19 患者临床结局预测

Nat Med. 2021 Oct;27(10):1735-1743. doi: 10.1038/s41591-021-01506-3. Epub 2021 Sep 15.

NPJ Digit Med. 2020 Apr 14;3:57. doi: 10.1038/s41746-020-0258-y. eCollection 2020.

Enabling precision medicine in neonatology, an integrated repository for preterm birth research.在新生儿学中实现精准医学，建立一个整合的早产儿研究资源库。

Sci Data. 2018 Nov 6;5:180219. doi: 10.1038/sdata.2018.219.

Improved de-identification of physician notes through integrative modeling of both public and private medical text.通过整合公有和私有医疗文本进行建模，提高医生笔记的去识别化程度。

BMC Med Inform Decis Mak. 2013 Oct 2;13:112. doi: 10.1186/1472-6947-13-112.

The MITRE Identification Scrubber Toolkit: design, training, and assessment.MITRE 识别清理工具包：设计、培训和评估。

Int J Med Inform. 2010 Dec;79(12):849-59. doi: 10.1016/j.ijmedinf.2010.09.007. Epub 2010 Oct 14.

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.梅奥临床文本分析和知识提取系统（cTAKES）：架构、组件评估和应用。

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):507-13. doi: 10.1136/jamia.2009.001560.

Medical education research and IRB review: an analysis and comparison of the IRB review process at six institutions.医学教育研究与机构审查委员会（IRB）审查：六所机构IRB审查过程的分析与比较

Acad Med. 2007 Jul;82(7):654-60. doi: 10.1097/ACM.0b013e318065be1e.

EMERSE: The Electronic Medical Record Search Engine.EMERSE：电子病历搜索引擎。

AMIA Annu Symp Proc. 2006;2006:941.

PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals.生理信号库、生理信号处理工具包和生理信号网络：复杂生理信号新研究资源的组成部分。

Circulation. 2000 Jun 13;101(23):E215-20. doi: 10.1161/01.cir.101.23.e215.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种用于所有临床文本文件的经认证的去识别化系统，可大规模进行信息提取。

A certified de-identification system for all clinical text documents for information extraction at scale.

作者信息

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

目标

材料与方法

结果

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献