Analytical Epidemiology and Health Impact Unit, Fondazione IRCCS "Istituto Nazionale dei Tumori", Milan, Italy.
Institute of Electronics, Computer and Telecommunication Engineering (IEIIT), National Research Council of Italy (CNR), Milan, Italy.
J Biomed Inform. 2021 Apr;116:103712. doi: 10.1016/j.jbi.2021.103712. Epub 2021 Feb 18.
Pathology reports represent a primary source of information for cancer registries. Hospitals routinely process high volumes of free-text reports, a valuable source of information regarding cancer diagnosis for improving clinical care and supporting research. Information extraction and coding of textual unstructured data is typically a manual, labour-intensive process. There is a need to develop automated approaches to extract meaningful information from such texts in a reliable and accurate way. In this scenario, Natural Language Processing (NLP) algorithms offer a unique opportunity to automatically encode the unstructured reports into structured data, thus representing a potential powerful alternative to expensive manual processing. However, notwithstanding the increasing interest in this area, there is still limited availability of NLP approaches for pathology reports in languages other than English, including Italian, to date. The aim of our work was to develop an automated algorithm based on NLP techniques, able to identify and classify the morphological content of pathology reports in the Italian language with micro-averaged performance scores higher than 95%. Specifically, a novel, domain-specific classifier that uses linguistic rules was developed and tested on 27,239 pathology reports from a single Italian oncological centre, following the International Classification of Diseases for Oncology morphology classification standard (ICD-O-M). The proposed classification algorithm achieved successful results with a micro-F score of 98.14% on 9594 pathology reports in the test dataset. This algorithm relies on rules defined on data from a single hospital that is specifically dedicated to cancer, but it is based on general processing steps which can be applied to different datasets. Further research will be important to demonstrate the generalizability of the proposed approach on a larger corpus from different hospitals.
病理学报告是癌症登记处的主要信息来源。医院通常会处理大量的自由文本报告,这些报告是提高临床护理质量和支持研究的宝贵癌症诊断信息来源。信息提取和对文本非结构化数据的编码通常是一个手动、劳动密集型的过程。因此,需要开发自动化方法,以便以可靠且准确的方式从这些文本中提取有意义的信息。在这种情况下,自然语言处理 (NLP) 算法提供了一个独特的机会,可以自动将非结构化报告编码为结构化数据,从而成为一种有潜力的强大替代手动处理的方法。然而,尽管人们对这一领域的兴趣日益浓厚,但迄今为止,除英语外,包括意大利语在内的其他语言的病理学报告的 NLP 方法仍然有限。我们的工作旨在开发一种基于 NLP 技术的自动化算法,该算法能够识别和分类意大利语病理学报告的形态内容,其微平均性能得分高于 95%。具体来说,开发并测试了一种新的、特定于领域的分类器,该分类器使用语言规则,并根据国际肿瘤学疾病分类形态学分类标准 (ICD-O-M) 对来自单个意大利肿瘤中心的 27239 份病理学报告进行了测试。该分类算法在测试数据集的 9594 份病理学报告上取得了成功,微-F 分数为 98.14%。该算法依赖于专门针对癌症的单个医院的数据定义规则,但它基于可以应用于不同数据集的一般处理步骤。进一步的研究对于证明所提出方法在来自不同医院的更大语料库上的泛化能力非常重要。