Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy.
Department of Electrical and Information Engineering, Politecnico of Bari, Bari, Italy.
Sci Rep. 2021 Dec 10;11(1):23823. doi: 10.1038/s41598-021-03204-z.
The unstructured nature of Real-World (RW) data from onco-hematological patients and the scarce accessibility to integrated systems restrain the use of RW information for research purposes. Natural Language Processing (NLP) might help in transposing unstructured reports into standardized electronic health records. We exploited NLP to develop an automated tool, named ARGO (Automatic Record Generator for Onco-hematology) to recognize information from pathology reports and populate electronic case report forms (eCRFs) pre-implemented by REDCap. ARGO was applied to hemo-lymphopathology reports of diffuse large B-cell, follicular, and mantle cell lymphomas, and assessed for accuracy (A), precision (P), recall (R) and F1-score (F) on internal (n = 239) and external (n = 93) report series. 326 (98.2%) reports were converted into corresponding eCRFs. Overall, ARGO showed high performance in capturing (1) identification report number (all metrics > 90%), (2) biopsy date (all metrics > 90% in both series), (3) specimen type (86.6% and 91.4% of A, 98.5% and 100.0% of P, 92.5% and 95.5% of F, and 87.2% and 91.4% of R for internal and external series, respectively), (4) diagnosis (100% of P with A, R and F of 90% in both series). We developed and validated a generalizable tool that generates structured eCRFs from real-life pathology reports.
真实世界(RW)中来自血液肿瘤患者的数据具有非结构化性质,且集成系统的获取途径稀缺,这限制了 RW 信息在研究中的应用。自然语言处理(NLP)可以帮助将非结构化报告转换为标准化的电子健康记录。我们利用 NLP 开发了一种名为 ARGO(Onco-Hematology 自动记录生成器)的自动化工具,用于识别病理报告中的信息并填充 REDCap 预先实施的电子病例报告表(eCRF)。ARGO 应用于弥漫性大 B 细胞淋巴瘤、滤泡性淋巴瘤和套细胞淋巴瘤的血液淋巴病理学报告,并在内部(n=239)和外部(n=93)报告系列中评估准确性(A)、精密度(P)、召回率(R)和 F1 分数(F)。326(98.2%)份报告被转换为相应的 eCRF。总体而言,ARGO 在捕获以下内容方面表现出了较高的性能:(1)报告编号的识别(所有指标均>90%),(2)活检日期(内部和外部系列的所有指标均>90%),(3)标本类型(内部和外部系列的 A 分别为 86.6%和 91.4%,P 为 98.5%和 100.0%,F 为 92.5%和 95.5%,R 为 87.2%和 91.4%),(4)诊断(内部和外部系列的 P 均为 100%,A、R 和 F 均为 90%)。我们开发并验证了一种可推广的工具,可从真实的病理报告中生成结构化的 eCRF。