• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

法国临床记录的自动去识别化:基于规则和机器学习方法的比较。

Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.

作者信息

Grouin Cyril, Zweigenbaum Pierre

机构信息

LIMSI-CNRS, Orsay, France.

出版信息

Stud Health Technol Inform. 2013;192:476-80.

PMID:23920600
Abstract

In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.

摘要

在本文中,我们对两种自动去识别法语书写的医疗记录的方法进行了比较:一种基于规则的系统和一种使用条件随机场(CRF)形式主义的基于机器学习的系统。这两种系统都旨在处理心脏病学医疗记录语料库中的九个标识符。我们进行了两项评估:首先,对62份心脏病学文档以及10份由光学字符识别(OCR)生成的胎儿病理学文档进行评估,以评估我们系统的稳健性。在心脏病学领域,我们基于规则的系统总体F值精确匹配率达到0.843,基于机器学习的系统达到0.883。虽然基于规则的系统在处理姓名(名字和姓氏)和数值数据(日期、电话号码和邮政编码)方面取得了良好的结果,但机器学习方法在更复杂的类别(邮政地址、医院名称、医疗设备和城镇)上表现最佳。在胎儿病理学语料库上,尽管我们的系统并非为此语料库设计,且存在OCR字符识别错误,但我们仍取得了令人鼓舞的结果:基于规则的系统总体F值精确匹配率为0.681;基于机器学习的系统为0.638。这表明现有工具可应用于处理质量较低的新文档。

相似文献

1
Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.法国临床记录的自动去识别化:基于规则和机器学习方法的比较。
Stud Health Technol Inform. 2013;192:476-80.
2
De-identification of clinical notes in French: towards a protocol for reference corpus development.法语临床记录的去识别化:迈向参考语料库开发协议
J Biomed Inform. 2014 Aug;50:151-61. doi: 10.1016/j.jbi.2013.12.014. Epub 2013 Dec 29.
3
Automatic de-identification of electronic medical records using token-level and character-level conditional random fields.使用令牌级和字符级条件随机场对电子病历进行自动去识别。
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S47-S52. doi: 10.1016/j.jbi.2015.06.009. Epub 2015 Jun 26.
4
Proposal and evaluation of FASDIM, a Fast And Simple De-Identification Method for unstructured free-text clinical records.提出并评估了 FASDIM,一种用于非结构化自由文本临床记录的快速简便去识别方法。
Int J Med Inform. 2014 Apr;83(4):303-12. doi: 10.1016/j.ijmedinf.2013.11.005. Epub 2013 Dec 7.
5
Automatic detection of protected health information from clinic narratives.从临床记录中自动检测受保护的健康信息。
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S30-S38. doi: 10.1016/j.jbi.2015.06.015. Epub 2015 Jul 29.
6
Mining fall-related information in clinical notes: Comparison of rule-based and novel word embedding-based machine learning approaches.挖掘临床记录中与跌倒相关的信息:基于规则和基于新颖词嵌入的机器学习方法的比较。
J Biomed Inform. 2019 Feb;90:103103. doi: 10.1016/j.jbi.2019.103103. Epub 2019 Jan 9.
7
Extracting important information from Chinese Operation Notes with natural language processing methods.运用自然语言处理方法从中文手术记录中提取重要信息。
J Biomed Inform. 2014 Apr;48:130-6. doi: 10.1016/j.jbi.2013.12.017. Epub 2014 Jan 31.
8
De-identification of health records using Anonym: effectiveness and robustness across datasets.使用Anonym对健康记录进行去识别:跨数据集的有效性和稳健性。
Artif Intell Med. 2014 Jul;61(3):145-51. doi: 10.1016/j.artmed.2014.03.006. Epub 2014 Apr 3.
9
Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.用于去识别化的纵向临床记录标注:2014年i2b2/德克萨斯大学健康科学中心语料库
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S20-S29. doi: 10.1016/j.jbi.2015.07.020. Epub 2015 Aug 28.
10
CRFs based de-identification of medical records.基于病例报告表的医疗记录去识别化处理。
J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S39-S46. doi: 10.1016/j.jbi.2015.08.012. Epub 2015 Aug 24.

引用本文的文献

1
Harnessing Moderate-Sized Language Models for Reliable Patient Data Deidentification in Emergency Department Records: Algorithm Development, Validation, and Implementation Study.利用中等规模语言模型对急诊科记录中的患者数据进行可靠去识别:算法开发、验证与实施研究。
JMIR AI. 2025 Apr 1;4:e57828. doi: 10.2196/57828.
2
Salience of Medical Concepts of Inside Clinical Texts and Outside Medical Records for Referred Cardiovascular Patients.临床文本中及转诊心血管患者病历之外的医学概念对患者的显著程度
J Healthc Inform Res. 2019 Jan 28;3(2):200-219. doi: 10.1007/s41666-019-00044-5. eCollection 2019 Jun.
3
Investigation of the Utility of Features in a Clinical De-identification Model: A Demonstration Using EHR Pathology Reports for Advanced NSCLC Patients.
临床去识别模型中特征效用的研究:使用晚期非小细胞肺癌患者的电子健康记录病理报告进行的示范
Front Digit Health. 2022 Feb 16;4:728922. doi: 10.3389/fdgth.2022.728922. eCollection 2022.
4
Improving domain adaptation in de-identification of electronic health records through self-training.通过自训练提高电子健康记录去识别中的领域自适应。
J Am Med Inform Assoc. 2021 Sep 18;28(10):2093-2100. doi: 10.1093/jamia/ocab128.
5
De-identifying free text of Japanese electronic health records.去标识化日本电子健康记录的自由文本。
J Biomed Semantics. 2020 Sep 21;11(1):11. doi: 10.1186/s13326-020-00227-9.
6
CAS: corpus of clinical cases in French.法语临床病例语料库。
J Biomed Semantics. 2020 Aug 6;11(1):7. doi: 10.1186/s13326-020-00225-x.
7
Implementing a Cloud Based Method for Protected Clinical Trial Data Sharing.实施基于云的保护临床试验数据共享方法。
Pac Symp Biocomput. 2020;25:647-658.