Grouin Cyril, Zweigenbaum Pierre
LIMSI-CNRS, Orsay, France.
Stud Health Technol Inform. 2013;192:476-80.
In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.
在本文中,我们对两种自动去识别法语书写的医疗记录的方法进行了比较:一种基于规则的系统和一种使用条件随机场(CRF)形式主义的基于机器学习的系统。这两种系统都旨在处理心脏病学医疗记录语料库中的九个标识符。我们进行了两项评估:首先,对62份心脏病学文档以及10份由光学字符识别(OCR)生成的胎儿病理学文档进行评估,以评估我们系统的稳健性。在心脏病学领域,我们基于规则的系统总体F值精确匹配率达到0.843,基于机器学习的系统达到0.883。虽然基于规则的系统在处理姓名(名字和姓氏)和数值数据(日期、电话号码和邮政编码)方面取得了良好的结果,但机器学习方法在更复杂的类别(邮政地址、医院名称、医疗设备和城镇)上表现最佳。在胎儿病理学语料库上,尽管我们的系统并非为此语料库设计,且存在OCR字符识别错误,但我们仍取得了令人鼓舞的结果:基于规则的系统总体F值精确匹配率为0.681;基于机器学习的系统为0.638。这表明现有工具可应用于处理质量较低的新文档。