Suppr超能文献

基于电子健康记录的机器学习提取的临床特征的后预测推断。

Postprediction Inference for Clinical Characteristics Extracted With Machine Learning on Electronic Health Records.

机构信息

Flatiron Health, Inc, New York City, NY.

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD.

出版信息

JCO Clin Cancer Inform. 2023 May;7:e2200174. doi: 10.1200/CCI.22.00174.

Abstract

PURPOSE

Real-world data (RWD) derived from electronic health records (EHRs) are often used to understand population-level relationships between patient characteristics and cancer outcomes. Machine learning (ML) methods enable researchers to extract characteristics from unstructured clinical notes, and represent a more cost-effective and scalable approach than manual expert abstraction. These extracted data are then used in epidemiologic or statistical models as if they were abstracted observations. Analytical results derived from extracted data in this way may differ from those given by abstracted data, and the magnitude of this difference is not directly informed by standard ML performance metrics.

METHODS

In this paper, we define the task of postprediction inference, which is to recover similar estimation and inference from an ML-extracted variable that would be obtained from abstracting the variable. We consider fitting a Cox proportional hazards model that uses a binary ML-extracted variable as a covariate and evaluate four approaches for postprediction inference in this setting. The first two approaches only require the ML-predicted probability, while the latter two additionally require a labeled (human abstracted) validation data set.

RESULTS

Our results for both simulated data and EHR-derived RWD from a national cohort demonstrate that we can improve inference from ML-extracted variables by leveraging a limited amount of labeled data.

CONCLUSION

We describe and evaluate methods for fitting statistical models using ML-extracted variables subject to model error. We show that estimation and inference is generally valid when using extracted data from high-performing ML models. More complex methods that incorporate auxiliary labeled data provide further improvements.

摘要

目的

从电子健康记录(EHR)中获得的真实世界数据(RWD)通常用于了解患者特征与癌症结局之间的人群水平关系。机器学习(ML)方法使研究人员能够从非结构化的临床记录中提取特征,并且比手动专家抽象更具成本效益和可扩展性。然后,将这些提取的数据用于流行病学或统计模型中,就好像它们是经过抽象的观察结果一样。以这种方式从提取的数据中得出的分析结果可能与从经过抽象的数据中得出的结果不同,而这种差异的大小不能直接由标准的 ML 性能指标来告知。

方法

在本文中,我们定义了后预测推理任务,即从 ML 提取变量中恢复类似于从变量抽象中获得的估计和推理。我们考虑拟合 Cox 比例风险模型,该模型将二进制 ML 提取变量用作协变量,并在这种情况下评估四种用于后预测推理的方法。前两种方法仅需要 ML 预测的概率,而后两种方法除了需要一个标记的(人工抽象的)验证数据集外。

结果

我们对模拟数据和来自全国队列的 EHR 衍生 RWD 的结果表明,我们可以通过利用有限数量的标记数据来改善对 ML 提取变量的推理。

结论

我们描述并评估了在存在模型误差的情况下使用 ML 提取变量拟合统计模型的方法。我们表明,当使用来自高性能 ML 模型的提取数据时,估计和推理通常是有效的。更复杂的方法,即结合辅助标记数据,提供了进一步的改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3360/10281422/8cdcc4cad21f/cci-7-e2200174-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验