Lee Robert Y, Li Kevin S, Sibley James, Cohen Trevor, Lober William B, O'Brien Janaki, LeDuc Nicole, Andrews Kasey Mallon, Ungar Anna, Walsh Jessica, Nielsen Elizabeth L, Dotolo Danae G, Kross Erin K
Division of Pulmonary, Critical Care, and Sleep Medicine, University of Washington, Seattle, USA.
Cambia Palliative Care Center of Excellence at UW Medicine, University of Washington, Seattle, USA.
medRxiv. 2025 Jun 24:2025.06.23.25330134. doi: 10.1101/2025.06.23.25330134.
Natural language processing (NLP) allows efficient extraction of clinical variables and outcomes from electronic health records (EHR). However, measuring pragmatic clinical trial outcomes may demand accuracy that exceeds NLP performance. Combining NLP with human adjudication can address this gap, yet few software solutions support such workflows. We developed a modular, scalable system for NLP-screened human abstraction to measure the primary outcomes of two clinical trials.
In two clinical trials of hospitalized patients with serious illness, a deep-learning NLP model screened EHR passages for documented goals-of-care discussions. Screen-positive passages were referred for human adjudication using a REDCap-based system to measure the trial outcomes. Dynamic pooling of passages using structured query language (SQL) within the REDCap database reduced unnecessary abstraction while ensuring data completeness.
In the first trial (N=2,512), NLP identified 22,187 screen-positive passages (0.8%) from 2.6 million EHR passages. Human reviewers adjudicated 7,494 passages over 34.3 abstractor-hours to measure the cumulative incidence and time to first documented goals-of-care discussion for all patients with 92.6% patient-level sensitivity. In the second trial (N=617), NLP identified 8,952 screen-positive passages (1.6%) from 559,596 passages at a threshold with near-100% sensitivity. Human reviewers adjudicated 3,509 passages over 27.9 abstractor-hours to measure the same outcome for all patients.
We present the design and source code for a scalable and efficient pipeline for measuring complex EHR-derived outcomes using NLP-screened human abstraction. This implementation is adaptable to diverse research needs, and its modular pipeline represents a practical middle ground between custom software and commercial platforms.
自然语言处理(NLP)可从电子健康记录(EHR)中高效提取临床变量和结果。然而,衡量务实的临床试验结果可能需要超出NLP性能的准确性。将NLP与人工判定相结合可以弥补这一差距,但很少有软件解决方案支持此类工作流程。我们开发了一个模块化、可扩展的系统,用于NLP筛选后的人工提取,以衡量两项临床试验的主要结果。
在两项针对重症住院患者的临床试验中,一个深度学习NLP模型对EHR段落进行筛选,以查找记录在案的照护目标讨论。筛选呈阳性的段落会使用基于REDCap的系统进行人工判定,以衡量试验结果。在REDCap数据库中使用结构化查询语言(SQL)对段落进行动态汇总,减少了不必要的提取工作,同时确保了数据完整性。
在第一项试验(N = 2512)中,NLP从260万条EHR段落中识别出22187条筛选呈阳性的段落(0.8%)。人工评审员在34.3个提取工时内对7494条段落进行了判定,以衡量所有患者首次记录照护目标讨论的累积发生率和时间,患者层面的敏感性为92.6%。在第二项试验(N = 617)中,NLP在接近100%敏感性的阈值下,从559596条段落中识别出8952条筛选呈阳性的段落(1.6%)。人工评审员在27.9个提取工时内对3509条段落进行了判定,以衡量所有患者的相同结果。
我们展示了一个可扩展且高效的流程的设计和源代码,该流程使用NLP筛选后的人工提取来衡量复杂的EHR衍生结果。此实施方案适用于各种研究需求,其模块化流程代表了定制软件和商业平台之间的实用中间立场。