Benedum Corey M, Sondhi Arjun, Fidyk Erin, Cohen Aaron B, Nemeth Sheila, Adamson Blythe, Estévez Melissa, Bozkurt Selen
Flatiron Health, Inc., 233 Spring Street, New York, NY 10003, USA.
Department of Medicine, NYU Grossman School of Medicine, New York, NY 10016, USA.
Cancers (Basel). 2023 Mar 20;15(6):1853. doi: 10.3390/cancers15061853.
Meaningful real-world evidence (RWE) generation requires unstructured data found in electronic health records (EHRs) which are often missing from administrative claims; however, obtaining relevant data from unstructured EHR sources is resource-intensive. In response, researchers are using natural language processing (NLP) with machine learning (ML) techniques (i.e., ) to extract real-world data (RWD) at scale. This study assessed the quality and fitness-for-use of EHR-derived oncology data curated using NLP with ML as compared to the reference standard of expert abstraction. Using a sample of 186,313 patients with lung cancer from a nationwide EHR-derived de-identified database, we performed a series of replication analyses demonstrating some common analyses conducted in retrospective observational research with complex EHR-derived data to generate evidence. Eligible patients were selected into biomarker- and treatment-defined cohorts, first with expert-abstracted then with ML-extracted data. We utilized the biomarker- and treatment-defined cohorts to perform analyses related to biomarker-associated survival and treatment comparative effectiveness, respectively. Across all analyses, the results differed by less than 8% between the data curation methods, and similar conclusions were reached. These results highlight that high-performance ML-extracted variables trained on expert-abstracted data can achieve similar results as when using abstracted data, unlocking the ability to perform oncology research at scale.
有意义的真实世界证据(RWE)生成需要电子健康记录(EHR)中的非结构化数据,而行政索赔中往往缺少这些数据;然而,从非结构化EHR来源获取相关数据需要大量资源。作为回应,研究人员正在使用自然语言处理(NLP)和机器学习(ML)技术(即 )大规模提取真实世界数据(RWD)。本研究评估了与专家提取的参考标准相比,使用NLP和ML整理的EHR衍生肿瘤学数据的质量和适用性。我们从一个全国性的EHR衍生的去识别数据库中选取了186313例肺癌患者作为样本,进行了一系列重复分析,展示了在回顾性观察研究中对复杂的EHR衍生数据进行的一些常见分析,以生成证据。符合条件的患者被选入生物标志物和治疗定义的队列,首先使用专家提取的数据,然后使用ML提取的数据。我们利用生物标志物和治疗定义的队列分别进行与生物标志物相关生存和治疗比较有效性相关的分析。在所有分析中,两种数据整理方法的结果差异小于8%,并得出了相似的结论。这些结果表明,在专家提取的数据上训练的高性能ML提取变量可以获得与使用提取的数据相似的结果,从而开启了大规模开展肿瘤学研究的能力。