Universidad Nacional de Colombia, Facultad de Medicina, Departamento de Patología, Bogotá, Colombia.
Quantil SAS. Bogotá, Colombia.
Colomb Med (Cali). 2023 Mar 30;54(1):e2035300. doi: 10.25100/cm.v54i1.5300. eCollection 2023 Jan-Mar.
Pathology reports are stored as unstructured, ungrammatical, fragmented, and abbreviated free text with linguistic variability among pathologists. For this reason, tumor information extraction requires a significant human effort. Recording data in an efficient and high-quality format is essential in implementing and establishing a hospital-based-cancer registry.
This study aimed to describe implementing a natural language processing algorithm for oncology pathology reports.
An algorithm was developed to process oncology pathology reports in Spanish to extract 20 medical descriptors. The approach is based on the successive coincidence of regular expressions.
The validation was performed with 140 pathological reports. The topography identification was performed manually by humans and the algorithm in all reports. The human identified morphology in 138 reports and by the algorithm in 137. The average fuzzy matching score was 68.3 for Topography and 89.5 for Morphology.
A preliminary algorithm validation against human extraction was performed over a small set of reports with satisfactory results. This shows that a regular-expression approach can accurately and precisely extract multiple specimen attributes from free-text Spanish pathology reports. Additionally, we developed a website to facilitate collaborative validation at a larger scale which may be helpful for future research on the subject.
病理报告以非结构化、不合语法、碎片化和缩写的自由文本形式存储,病理学家之间存在语言差异。因此,肿瘤信息提取需要大量的人力。以高效和高质量的格式记录数据对于实施和建立基于医院的癌症登记至关重要。
本研究旨在描述一种用于肿瘤病理学报告的自然语言处理算法的实现。
开发了一种算法,用于处理西班牙语的肿瘤病理学报告,以提取 20 个医学描述符。该方法基于正则表达式的连续匹配。
在 140 份病理报告中进行了验证。通过人工和算法在所有报告中进行了 topography 识别。人类在 138 份报告中识别了形态,而算法在 137 份报告中识别了形态。Topography 的平均模糊匹配分数为 68.3,Morphology 的平均模糊匹配分数为 89.5。
对一小部分报告进行了针对人工提取的初步算法验证,结果令人满意。这表明,正则表达式方法可以从自由文本的西班牙语病理学报告中准确、精确地提取多个标本属性。此外,我们开发了一个网站,以方便更大规模的协作验证,这对于该主题的未来研究可能会有所帮助。