• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

提高机器学习应用中实验室测量的密度

Increasing the Density of Laboratory Measures for Machine Learning Applications.

作者信息

Abedi Vida, Li Jiang, Shivakumar Manu K, Avula Venkatesh, Chaudhary Durgesh P, Shellenberger Matthew J, Khara Harshit S, Zhang Yanfei, Lee Ming Ta Michael, Wolk Donna M, Yeasin Mohammed, Hontecillas Raquel, Bassaganya-Riera Josep, Zand Ramin

机构信息

Department of Molecular and Functional Genomics, Geisinger Health System, Danville, PA 17822, USA.

NIMML Institute, Blacksburg, VA 24060, USA.

出版信息

J Clin Med. 2020 Dec 30;10(1):103. doi: 10.3390/jcm10010103.

DOI:10.3390/jcm10010103
PMID:33396741
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7795258/
Abstract

BACKGROUND

The imputation of missingness is a key step in Electronic Health Records (EHR) mining, as it can significantly affect the conclusions derived from the downstream analysis in translational medicine. The missingness of laboratory values in EHR is not at random, yet imputation techniques tend to disregard this key distinction. Consequently, the development of an adaptive imputation strategy designed specifically for EHR is an important step in improving the data imbalance and enhancing the predictive power of modeling tools for healthcare applications.

METHOD

We analyzed the laboratory measures derived from Geisinger's EHR on patients in three distinct cohorts-patients tested for (Cdiff) infection, patients with a diagnosis of inflammatory bowel disease (IBD), and patients with a diagnosis of hip or knee osteoarthritis (OA). We extracted Logical Observation Identifiers Names and Codes (LOINC) from which we excluded those with 75% or more missingness. The comorbidities, primary or secondary diagnosis, as well as active problem lists, were also extracted. The adaptive imputation strategy was designed based on a hybrid approach. The comorbidity patterns of patients were transformed into latent patterns and then clustered. Imputation was performed on a cluster of patients for each cohort independently to show the generalizability of the method. The results were compared with imputation applied to the complete dataset without incorporating the information from comorbidity patterns.

RESULTS

We analyzed a total of 67,445 patients (11,230 IBD patients, 10,000 OA patients, and 46,215 patients tested for infection). We extracted 495 LOINC and 11,230 diagnosis codes for the IBD cohort, 8160 diagnosis codes for the Cdiff cohort, and 2042 diagnosis codes for the OA cohort based on the primary/secondary diagnosis and active problem list in the EHR. Overall, the most improvement from this strategy was observed when the laboratory measures had a higher level of missingness. The best root mean square error (RMSE) difference for each dataset was recorded as -35.5 for the Cdiff, -8.3 for the IBD, and -11.3 for the OA dataset.

CONCLUSIONS

An adaptive imputation strategy designed specifically for EHR that uses complementary information from the clinical profile of the patient can be used to improve the imputation of missing laboratory values, especially when laboratory codes with high levels of missingness are included in the analysis.

摘要

背景

缺失值插补是电子健康记录(EHR)挖掘中的关键步骤,因为它会显著影响转化医学下游分析得出的结论。EHR中实验室检查值的缺失并非随机,但插补技术往往忽略了这一关键区别。因此,开发专门针对EHR的自适应插补策略是改善数据不平衡以及增强医疗保健应用建模工具预测能力的重要一步。

方法

我们分析了从盖辛格医疗系统(Geisinger)的EHR中获取的针对三个不同队列患者的实验室检查指标,这三个队列分别为:接受艰难梭菌(Cdiff)感染检测的患者、诊断为炎症性肠病(IBD)的患者以及诊断为髋或膝骨关节炎(OA)的患者。我们提取了逻辑观察标识符名称和代码(LOINC),并排除了缺失率达到或超过75%的指标。同时还提取了合并症、主要或次要诊断以及当前问题列表。自适应插补策略基于一种混合方法设计。将患者的合并症模式转化为潜在模式,然后进行聚类。对每个队列中的一组患者独立进行插补,以展示该方法的通用性。将结果与应用于完整数据集且未纳入合并症模式信息的插补结果进行比较。

结果

我们总共分析了67445名患者(11230名IBD患者、10000名OA患者以及46215名接受Cdiff感染检测的患者)。基于EHR中的主要/次要诊断和当前问题列表,我们为IBD队列提取了495个LOINC和11230个诊断代码,为Cdiff队列提取了8160个诊断代码,为OA队列提取了2042个诊断代码。总体而言,当实验室检查指标的缺失程度较高时,该策略带来的改善最为明显。每个数据集的最佳均方根误差(RMSE)差异记录如下:Cdiff数据集为-35.5,IBD数据集为-8.3,OA数据集为-11.3。

结论

专门为EHR设计的自适应插补策略,利用患者临床特征中的补充信息,可用于改善缺失实验室检查值的插补,特别是当分析中包含缺失率较高的实验室代码时。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/d823576e2b8c/jcm-10-00103-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/c4a4e4ad0d13/jcm-10-00103-g0A1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/958ad16666a7/jcm-10-00103-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/09b754f46120/jcm-10-00103-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/c0ab324ad860/jcm-10-00103-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/d823576e2b8c/jcm-10-00103-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/c4a4e4ad0d13/jcm-10-00103-g0A1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/958ad16666a7/jcm-10-00103-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/09b754f46120/jcm-10-00103-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/c0ab324ad860/jcm-10-00103-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4402/7795258/d823576e2b8c/jcm-10-00103-g004.jpg

相似文献

1
Increasing the Density of Laboratory Measures for Machine Learning Applications.提高机器学习应用中实验室测量的密度
J Clin Med. 2020 Dec 30;10(1):103. doi: 10.3390/jcm10010103.
2
Imputation of missing values for electronic health record laboratory data.电子健康记录实验室数据缺失值的插补
NPJ Digit Med. 2021 Oct 11;4(1):147. doi: 10.1038/s41746-021-00518-0.
3
A novel missing data imputation approach based on clinical conditional Generative Adversarial Networks applied to EHR datasets.基于临床条件生成对抗网络的新型缺失数据插补方法在电子健康记录数据集的应用。
Comput Biol Med. 2023 Sep;163:107188. doi: 10.1016/j.compbiomed.2023.107188. Epub 2023 Jun 22.
4
Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data.基于现代机器学习方法在电子健康记录数据中的应用表现。
Epidemiology. 2023 Mar 1;34(2):206-215. doi: 10.1097/EDE.0000000000001578. Epub 2022 Dec 9.
5
Incorporating informatively collected laboratory data from EHR in clinical prediction models.将电子健康记录中信息采集的实验室数据纳入临床预测模型中。
BMC Med Inform Decis Mak. 2024 Jul 24;24(1):206. doi: 10.1186/s12911-024-02612-1.
6
Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.电子健康记录中结构化缺失数据的特征描述与管理:数据分析
JMIR Med Inform. 2018 Feb 23;6(1):e11. doi: 10.2196/medinform.8960.
7
Inverse Probability of Treatment Weighting and Confounder Missingness in Electronic Health Record-based Analyses: A Comparison of Approaches Using Plasmode Simulation.基于电子病历的分析中治疗反概率加权和混杂因素缺失:使用 Plasmode 模拟比较方法。
Epidemiology. 2023 Jul 1;34(4):520-530. doi: 10.1097/EDE.0000000000001618. Epub 2023 Apr 26.
8
9
Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study.在电子健康记录中,针对机器学习的极度缺失数值数据可以通过考虑信息性缺失的简单插补方法来处理:一项关于COVID-19死亡率案例研究中各种解决方案的比较
Comput Methods Programs Biomed. 2023 Dec;242:107803. doi: 10.1016/j.cmpb.2023.107803. Epub 2023 Sep 7.
10
A hybrid of whale optimization and late acceptance hill climbing based imputation to enhance classification performance in electronic health records.基于鲸鱼优化算法和后期接受爬山算法的混合插补方法提高电子健康记录中的分类性能。
J Biomed Inform. 2019 Jun;94:103190. doi: 10.1016/j.jbi.2019.103190. Epub 2019 May 2.

引用本文的文献

1
Machine Learning-Based Prediction of Stroke in Emergency Departments.基于机器学习的急诊科中风预测
Ther Adv Neurol Disord. 2024 Apr 1;17:17562864241239108. doi: 10.1177/17562864241239108. eCollection 2024.
2
Defining the Age of Young Ischemic Stroke Using Data-Driven Approaches.使用数据驱动方法定义青年缺血性卒中的年龄
J Clin Med. 2023 Mar 30;12(7):2600. doi: 10.3390/jcm12072600.
3
Emerging role for R-loop formation in hepatocellular carcinoma.R 环形成在肝细胞癌中的新作用。

本文引用的文献

1
Can the Use of Bayesian Analysis Methods Correct for Incompleteness in Electronic Health Records Diagnosis Data? Development of a Novel Method Using Simulated and Real-Life Clinical Data.贝叶斯分析方法能否纠正电子健康记录诊断数据中的不完整性?使用模拟和真实临床数据开发一种新方法。
Front Public Health. 2020 Mar 5;8:54. doi: 10.3389/fpubh.2020.00054. eCollection 2020.
2
Recurrent Neural Networks for Early Detection of Heart Failure From Longitudinal Electronic Health Record Data: Implications for Temporal Modeling With Respect to Time Before Diagnosis, Data Density, Data Quantity, and Data Type.基于纵向电子健康记录数据的循环神经网络用于心力衰竭的早期检测:关于诊断前时间、数据密度、数据量和数据类型的时间建模的意义
Circ Cardiovasc Qual Outcomes. 2019 Oct;12(10):e005114. doi: 10.1161/CIRCOUTCOMES.118.005114. Epub 2019 Oct 15.
3
Genes Genomics. 2023 May;45(5):543-551. doi: 10.1007/s13258-022-01360-8. Epub 2023 Jan 12.
4
Artificial Intelligence: A Shifting Paradigm in Cardio-Cerebrovascular Medicine.人工智能:心血管医学中不断变化的范式。
J Clin Med. 2021 Dec 6;10(23):5710. doi: 10.3390/jcm10235710.
5
Imputation of missing values for electronic health record laboratory data.电子健康记录实验室数据缺失值的插补
NPJ Digit Med. 2021 Oct 11;4(1):147. doi: 10.1038/s41746-021-00518-0.
6
Predicting short and long-term mortality after acute ischemic stroke using EHR.利用电子健康记录预测急性缺血性脑卒中患者的短期和长期死亡率。
J Neurol Sci. 2021 Aug 15;427:117560. doi: 10.1016/j.jns.2021.117560. Epub 2021 Jun 29.
7
Machine Learning-Enabled 30-Day Readmission Model for Stroke Patients.用于中风患者的机器学习驱动的30天再入院模型
Front Neurol. 2021 Mar 31;12:638267. doi: 10.3389/fneur.2021.638267. eCollection 2021.
8
Prediction of Long-Term Stroke Recurrence Using Machine Learning Models.使用机器学习模型预测长期中风复发
J Clin Med. 2021 Mar 20;10(6):1286. doi: 10.3390/jcm10061286.
9
Early Detection of Septic Shock Onset Using Interpretable Machine Learners.使用可解释机器学习算法早期检测脓毒症休克发作
J Clin Med. 2021 Jan 15;10(2):301. doi: 10.3390/jcm10020301.
Integration of genetic and clinical information to improve imputation of data missing from electronic health records.整合遗传和临床信息,以改善电子健康记录中缺失数据的推断。
J Am Med Inform Assoc. 2019 Oct 1;26(10):1056-1063. doi: 10.1093/jamia/ocz041.
4
Artificial Intelligence Transforms the Future of Health Care.人工智能改变医疗保健的未来。
Am J Med. 2019 Jul;132(7):795-801. doi: 10.1016/j.amjmed.2019.01.017. Epub 2019 Jan 31.
5
Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.电子健康记录中结构化缺失数据的特征描述与管理:数据分析
JMIR Med Inform. 2018 Feb 23;6(1):e11. doi: 10.2196/medinform.8960.
6
3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data.3D-MICE:用于多分析物纵向临床数据的截面和纵向插补的集成。
J Am Med Inform Assoc. 2018 Jun 1;25(6):645-653. doi: 10.1093/jamia/ocx133.
7
Early Detection of Heart Failure Using Electronic Health Records: Practical Implications for Time Before Diagnosis, Data Diversity, Data Quantity, and Data Density.利用电子健康记录早期检测心力衰竭:对诊断前时间、数据多样性、数据量和数据密度的实际影响
Circ Cardiovasc Qual Outcomes. 2016 Nov;9(6):649-658. doi: 10.1161/CIRCOUTCOMES.116.002797. Epub 2016 Nov 8.
8
An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.使用心脏成像数据的近期机器学习插补方法概述与评估
Data (Basel). 2017 Mar;2(1). doi: 10.3390/data2010008. Epub 2017 Jan 25.
9
NLRX1 Regulates Effector and Metabolic Functions of CD4 T Cells.NLRX1调节CD4 T细胞的效应和代谢功能。
J Immunol. 2017 Mar 15;198(6):2260-2268. doi: 10.4049/jimmunol.1601547. Epub 2017 Feb 3.
10
MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS.使用深度自编码器对电子健康记录中的缺失数据进行插补
Pac Symp Biocomput. 2017;22:207-218. doi: 10.1142/9789813207813_0021.