Suppr超能文献

从西班牙语肿瘤病理学报告的自由文本中自动提取信息。

Automated extraction of information from free text of Spanish oncology pathology reports.

机构信息

Universidad Nacional de Colombia, Facultad de Medicina, Departamento de Patología, Bogotá, Colombia.

Quantil SAS. Bogotá, Colombia.

出版信息

Colomb Med (Cali). 2023 Mar 30;54(1):e2035300. doi: 10.25100/cm.v54i1.5300. eCollection 2023 Jan-Mar.

Abstract

BACKGROUND

Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry.

OBJECTIVE

This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports.

METHODS

An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions.

RESULTS

The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology.

CONCLUSIONS

A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject.

摘要

背景

病理报告以非结构化、不合语法、碎片化和缩写的自由文本形式存储,病理学家之间存在语言差异。因此,肿瘤信息提取需要大量的人力。以高效和高质量的格式记录数据对于实施和建立基于医院的癌症登记至关重要。

目的

本研究旨在描述一种用于肿瘤病理学报告的自然语言处理算法的实现。

方法

开发了一种算法,用于处理西班牙语的肿瘤病理学报告,以提取 20 个医学描述符。该方法基于正则表达式的连续匹配。

结果

在 140 份病理报告中进行了验证。通过人工和算法在所有报告中进行了 topography 识别。人类在 138 份报告中识别了形态,而算法在 137 份报告中识别了形态。Topography 的平均模糊匹配分数为 68.3,Morphology 的平均模糊匹配分数为 89.5。

结论

对一小部分报告进行了针对人工提取的初步算法验证,结果令人满意。这表明,正则表达式方法可以从自由文本的西班牙语病理学报告中准确、精确地提取多个标本属性。此外,我们开发了一个网站,以方便更大规模的协作验证,这对于该主题的未来研究可能会有所帮助。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验