Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN.
Department of Hematology/Oncology, Mayo Clinic, Scottsdale, AZ.
JCO Clin Cancer Inform. 2022 Jul;6:e2200006. doi: 10.1200/CCI.22.00006.
The advancement of natural language processing (NLP) has promoted the use of detailed textual data in electronic health records (EHRs) to support cancer research and to facilitate patient care. In this review, we aim to assess EHR for cancer research and patient care by using the Minimal Common Oncology Data Elements (mCODE), which is a community-driven effort to define a minimal set of data elements for cancer research and practice. Specifically, we aim to assess the alignment of NLP-extracted data elements with mCODE and review existing NLP methodologies for extracting said data elements.
Published literature studies were searched to retrieve cancer-related NLP articles that were written in English and published between January 2010 and September 2020 from main literature databases. After the retrieval, articles with EHRs as the data source were manually identified. A charting form was developed for relevant study analysis and used to categorize data including four main topics: metadata, EHR data and targeted cancer types, NLP methodology, and oncology data elements and standards.
A total of 123 publications were selected finally and included in our analysis. We found that cancer research and patient care require some data elements beyond mCODE as expected. Transparency and reproductivity are not sufficient in NLP methods, and inconsistency in NLP evaluation exists.
We conducted a comprehensive review of cancer NLP for research and patient care using EHRs data. Issues and barriers for wide adoption of cancer NLP were identified and discussed.
自然语言处理(NLP)的进步促进了电子健康记录(EHR)中详细文本数据的使用,以支持癌症研究和促进患者护理。在本综述中,我们旨在使用最小共同肿瘤学数据元素(mCODE)评估 EHR 在癌症研究和患者护理中的应用,mCODE 是一个社区驱动的努力,旨在定义一组用于癌症研究和实践的最小数据元素。具体来说,我们旨在评估 NLP 提取的数据元素与 mCODE 的一致性,并审查用于提取这些数据元素的现有 NLP 方法。
从主要文献数据库中检索了 2010 年 1 月至 2020 年 9 月期间发表的以英语撰写的与癌症相关的 NLP 文章,以进行文献检索。检索后,手动确定了以 EHR 为数据源的文章。为相关研究分析开发了一个图表表格,用于对包括四个主要主题的数据进行分类:元数据、EHR 数据和目标癌症类型、NLP 方法以及肿瘤学数据元素和标准。
最终共选择了 123 篇出版物并纳入我们的分析。我们发现,癌症研究和患者护理需要一些超出 mCODE 的数据元素。NLP 方法的透明度和可重复性不足,并且存在 NLP 评估的不一致性。
我们对使用 EHR 数据进行癌症 NLP 研究和患者护理进行了全面综述。确定并讨论了癌症 NLP 广泛应用的问题和障碍。