Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France.
Cervico-facial Oncology Surgical Department, University Institute of Face and Neck, University of Côte d'Azur, Nice, France.
JCO Clin Cancer Inform. 2022 Jul;6:e2100199. doi: 10.1200/CCI.21.00199.
Electronic medical records are a valuable source of information about patients' clinical status but are often free-text documents that require laborious manual review to be exploited. Techniques from computer science have been investigated, but the literature has marginally focused on non-English language texts. We developed RUBY, a tool designed in collaboration with IBM-France to automatically structure clinical information from French medical records of patients with breast cancer.
RUBY, which exploits state-of-the-art Named Entity Recognition models combined with keyword extraction and postprocessing rules, was applied on clinical texts. We investigated the precision of RUBY in extracting the target information.
RUBY has an average precision of 92.8% for the Surgery report, 92.7% for the Pathology report, 98.1% for the Biopsy report, and 81.8% for the Consultation report.
These results show that the automatic approach has the potential to effectively extract clinical knowledge from an extensive set of electronic medical records, reducing the manual effort required and saving a significant amount of time. A deeper semantic analysis and further understanding of the context in the text, as well as training on a larger and more recent set of reports, including those containing highly variable entities and the use of ontologies, could further improve the results.
电子病历是患者临床状况的有价值信息来源,但通常是需要费力手动审查才能利用的纯文本文件。已经研究了来自计算机科学的技术,但文献仅略微关注非英语语言文本。我们开发了 RUBY,这是一款与 IBM-France 合作设计的工具,用于自动从乳腺癌患者的法国医疗记录中提取临床信息。
RUBY 利用最先进的命名实体识别模型结合关键字提取和后处理规则应用于临床文本。我们研究了 RUBY 在提取目标信息方面的精度。
RUBY 在手术报告中的平均精度为 92.8%,在病理报告中的平均精度为 92.7%,在活检报告中的平均精度为 98.1%,在咨询报告中的平均精度为 81.8%。
这些结果表明,自动方法有可能有效地从大量电子病历中提取临床知识,减少所需的手动工作量并节省大量时间。更深入的语义分析和对文本中上下文的进一步理解,以及在更大、更新的报告集上进行训练,包括那些包含高度可变实体和使用本体的报告,都可以进一步提高结果。