Digital, Informatics and Technology Solutions, Memorial Sloan Kettering Cancer Center, New York, New York.
Department of Translational Informatics, Memorial Sloan Kettering Cancer Center, New York, New York.
Cancer Res Commun. 2024 Apr 9;4(4):1041-1049. doi: 10.1158/2767-9764.CRC-24-0064.
Cancer research is dependent on accurate and relevant information of patient's medical journey. Data in radiology reports are of extreme value but lack consistent structure for direct use in analytics. At Memorial Sloan Kettering Cancer Center (MSKCC), the radiology reports are curated using gold-standard approach of using human annotators. However, the manual process of curating large volume of retrospective data slows the pace of cancer research. Manual curation process is sensitive to volume of reports, number of data elements and nature of reports and demand appropriate skillset. In this work, we explore state of the art methods in artificial intelligence (AI) and implement end-to-end pipeline for fast and accurate annotation of radiology reports. Language models (LM) are trained using curated data by approaching curation as multiclass or multilabel classification problem. The classification tasks are to predict multiple imaging scan sites, presence of cancer and cancer status from the reports. The trained natural language processing (NLP) model classifiers achieve high weighted F1 score and accuracy. We propose and demonstrate the use of these models to assist in the manual curation process which results in higher accuracy and F1 score with lesser time and cost, thus improving efforts of cancer research.
Extraction of structured data in radiology for cancer research with manual process is laborious. Using AI for extraction of data elements is achieved using NLP models' assistance is faster and more accurate.
癌症研究依赖于患者医疗历程的准确和相关信息。放射学报告中的数据极具价值,但缺乏直接用于分析的一致结构。在纪念斯隆凯特琳癌症中心(MSKCC),放射学报告是使用人类注释员的黄金标准方法进行整理的。然而,整理大量回顾性数据的手动过程会减缓癌症研究的步伐。手动整理过程对报告的数量、数据元素的数量和性质以及所需的技能组合都很敏感。在这项工作中,我们探索了人工智能(AI)的最新方法,并实施了端到端管道,以快速准确地注释放射学报告。使用通过将整理视为多类或多标签分类问题整理的数据来训练语言模型(LM)。分类任务是从报告中预测多个成像扫描部位、癌症的存在和癌症状态。经过训练的自然语言处理(NLP)模型分类器实现了较高的加权 F1 分数和准确性。我们提出并证明了这些模型在辅助手动整理过程中的使用,这可以提高准确性和 F1 分数,同时减少时间和成本,从而提高癌症研究的效率。
使用手动过程从放射学中提取癌症研究的结构化数据既费力又耗时。使用 AI 提取数据元素可以通过 NLP 模型的辅助更快、更准确地实现。