Adamson Blythe, Waskom Michael, Blarre Auriane, Kelly Jonathan, Krismer Konstantin, Nemeth Sheila, Gippetti James, Ritten John, Harrison Katherine, Ho George, Linzmayer Robin, Bansal Tarun, Wilkinson Samuel, Amster Guy, Estola Evan, Benedum Corey M, Fidyk Erin, Estévez Melissa, Shapiro Will, Cohen Aaron B
Flatiron Health, Inc., New York, NY, United States.
The Comparative Health Outcomes, Policy, and Economics (CHOICE) Institute, Department of Pharmacy, University of Washington, Seattle, WA, United States.
Front Pharmacol. 2023 Sep 15;14:1180962. doi: 10.3389/fphar.2023.1180962. eCollection 2023.
As artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAI's ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability. We applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (e.g., clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (i.e. not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information. We developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates. NLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.
随着人工智能(AI)借助自然语言处理(NLP)和机器学习(ML)领域的突破不断发展,例如OpenAI的ChatGPT等模型的开发,将电子健康记录(EHR)高效整理为真实世界数据(RWD)以用于肿瘤学证据生成的新机遇正在出现。我们的目标是描述促进透明度和可解释性的行业方法的研发情况。我们应用NLP和ML技术来训练、验证和测试从非结构化文档(如临床医生记录、放射学报告、实验室报告等)中提取信息,以输出RWD分析所需的一组结构化变量。本研究使用了一个全国性的源自电子健康记录(EHR)的数据库。根据性能选择模型。使用ML提取方法整理的变量是那些其值仅基于ML模型确定的变量(即未通过抽象确认),该模型从就诊记录和文档中识别关键信息。这些模型不预测未来事件或推断缺失信息。我们开发了一种使用NLP和ML从非结构化EHR文档中提取具有临床意义信息的方法,并发现与通过手动抽象数据整理的变量相比,输出变量具有高性能。这些提取方法产生了可供研究使用的变量,包括癌症初始诊断及日期、晚期/转移性诊断及日期、疾病分期、组织学、吸烟状况、手术状况及日期、生物标志物检测结果及日期,以及口服治疗及日期。NLP和ML能够快速且可扩展地提取EHR中的回顾性临床数据,以帮助研究人员从每个癌症患者的经历中学习。