基于电子健康记录的弱监督表型研究

Weakly Semi-supervised phenotyping using Electronic Health records.

机构信息

Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

出版信息

J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.

DOI:10.1016/j.jbi.2022.104175

PMID:36064111

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10112494/

Abstract

OBJECTIVE

Electronic Health Record (EHR) based phenotyping is a crucial yet challenging problem in the biomedical field. Though clinicians typically determine patient-level diagnoses via manual chart review, the sheer volume and heterogeneity of EHR data renders such tasks challenging, time-consuming, and prohibitively expensive, thus leading to a scarcity of clinical annotations in EHRs. Weakly supervised learning algorithms have been successfully applied to various EHR phenotyping problems, due to their ability to leverage information from large quantities of unlabeled samples to better inform predictions based on a far smaller number of patients. However, most weakly supervised methods are subject to the challenge to choose the right cutoff value to generate an optimal classifier. Furthermore, since they only utilize the most informative features (i.e., main ICD and NLP counts) they may fail for episodic phenotypes that cannot be consistently detected via ICD and NLP data. In this paper, we propose a label-efficient, weakly semi-supervised deep learning algorithm for EHR phenotyping (WSS-DL), which overcomes the limitations above.

MATERIALS AND METHODS

WSS-DL classifies patient-level disease status through a series of learning stages: 1) generating silver standard labels, 2) deriving enhanced-silver-standard labels by fitting a weakly supervised deep learning model to data with silver standard labels as outcomes and high dimensional EHR features as input, and 3) obtaining the final prediction score and classifier by fitting a supervised learning model to data with a minimal number of gold standard labels as the outcome, and the enhanced-silver-standard labels and a minimal set of most informative EHR features as input. To assess the generalizability of WSS-DL across different phenotypes and medical institutions, we apply WSS-DL to classify a total of 17 diseases, including both acute and chronic conditions, using EHR data from three healthcare systems. Additionally, we determine the minimum quantity of training labels required by WSS-DL to outperform existing supervised and semi-supervised phenotyping methods.

RESULTS

The proposed method, in combining the strengths of deep learning and weakly semi-supervised learning, successfully leverages the crucial phenotyping information contained in EHR features from unlabeled samples. Indeed, the deep learning model's ability to handle high-dimensional EHR features allows it to generate strong phenotype status predictions from silver standard labels. These predictions, in turn, provide highly effective features in the final logistic regression stage, leading to high phenotyping accuracy in notably small subsets of labeled data (e.g. n = 40 labeled samples).

CONCLUSION

Our method's high performance in EHR datasets with very small numbers of labels indicates its potential value in aiding doctors to diagnose rare diseases as well as conditions susceptible to misdiagnosis.

摘要

目的：电子病历（EHR）基于表型的方法是生物医学领域中一个非常关键但具有挑战性的问题。虽然临床医生通常通过手动图表审查来确定患者级别的诊断，但 EHR 数据的数量和异质性使得这些任务具有挑战性、耗时且非常昂贵，从而导致 EHR 中临床注释的稀缺。由于能够利用大量未标记样本的信息，因此弱监督学习算法已成功应用于各种 EHR 表型分析问题，从而可以根据少数患者更好地预测。然而，大多数弱监督方法都面临选择正确的截止值以生成最佳分类器的挑战。此外，由于它们仅利用最具信息性的特征（即主要的 ICD 和 NLP 计数），因此对于不能通过 ICD 和 NLP 数据一致检测到的偶发性表型，它们可能会失败。在本文中，我们提出了一种高效、弱半监督深度学习算法（WSS-DL）用于 EHR 表型分析，该算法克服了上述限制。

材料和方法：WSS-DL 通过一系列学习阶段对患者级别的疾病状态进行分类：1）生成银标准标签，2）通过将弱监督深度学习模型拟合到具有银标准标签作为结果和高维 EHR 特征作为输入的数据中，得出增强的银标准标签，3）通过将监督学习模型拟合到具有作为结果的最少数量的金标准标签的数据中，以及增强的银标准标签和最小数量的最具信息性的 EHR 特征作为输入，获得最终的预测得分和分类器。为了评估 WSS-DL 在不同表型和医疗机构中的泛化能力，我们使用来自三个医疗系统的 EHR 数据，应用 WSS-DL 来对总共 17 种疾病进行分类，包括急性和慢性疾病。此外，我们确定了 WSS-DL 超越现有监督和半监督表型分析方法所需的最小训练标签数量。

结果：该方法结合了深度学习和弱半监督学习的优势，成功利用了 EHR 特征中包含的关键表型信息。实际上，深度学习模型处理高维 EHR 特征的能力使其能够从银标准标签生成强大的表型状态预测。这些预测反过来又在最终的逻辑回归阶段提供了非常有效的特征，从而在显著较小的标记数据子集（例如 n=40 个标记样本）中实现了高表型准确性。

结论：我们的方法在具有非常少标签的 EHR 数据集上的高性能表明，它在帮助医生诊断罕见疾病和易误诊的疾病方面具有潜在价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f185/10112494/cff65c7883c4/nihms-1879275-f0001.jpg

相似文献

Weakly Semi-supervised phenotyping using Electronic Health records.基于电子健康记录的弱监督表型研究

J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.

Automated feature selection of predictors in electronic medical records data.电子病历数据中预测指标的自动特征选择

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.

Semi-supervised Double Deep Learning Temporal Risk Prediction (SeDDLeR) with Electronic Health Records.基于电子健康记录的半监督双深度学习时间风险预测（SeDDLeR）

J Biomed Inform. 2024 Sep;157:104685. doi: 10.1016/j.jbi.2024.104685. Epub 2024 Jul 14.

Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping.用于电子健康记录表型分析的先验自适应半监督学习

J Mach Learn Res. 2022;23.

Developing a FHIR-based EHR phenotyping framework: A case study for identification of patients with obesity and multiple comorbidities from discharge summaries.基于 FHIR 的电子健康记录表型框架的开发：以从出院小结中识别肥胖且伴有多种合并症的患者为例。

J Biomed Inform. 2019 Nov;99:103310. doi: 10.1016/j.jbi.2019.103310. Epub 2019 Oct 14.

Surrogate-assisted feature extraction for high-throughput phenotyping.用于高通量表型分析的代理辅助特征提取

J Am Med Inform Assoc. 2017 Apr 1;24(e1):e143-e149. doi: 10.1093/jamia/ocw135.

Deep semi-supervised multiple instance learning with self-correction for DME classification from OCT images.用于从光学相干断层扫描（OCT）图像中进行糖尿病性黄斑水肿（DME）分类的带自我校正的深度半监督多实例学习

Med Image Anal. 2023 Jan;83:102673. doi: 10.1016/j.media.2022.102673. Epub 2022 Oct 26.

Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms.半监督 ROC 分析用于可靠且精简的表型算法评估。

J Am Med Inform Assoc. 2024 Feb 16;31(3):640-650. doi: 10.1093/jamia/ocad226.

Enabling phenotypic big data with PheNorm.利用 PheNorm 实现表型大数据。

J Am Med Inform Assoc. 2018 Jan 1;25(1):54-60. doi: 10.1093/jamia/ocx111.

Combining unsupervised constraints on weakly supervised semantic segmentation of skin cancer.结合对皮肤癌弱监督语义分割的无监督约束。

Biomed Phys Eng Express. 2024 Aug 12;10(5). doi: 10.1088/2057-1976/ad644e.

引用本文的文献

LATTE: Label-efficient incident phenotyping from longitudinal electronic health records.LATTE：从纵向电子健康记录中进行高效标签事件表型分析。

Patterns (N Y). 2023 Dec 27;5(1):100906. doi: 10.1016/j.patter.2023.100906. eCollection 2024 Jan 12.

A data-driven approach to decode metabolic dysfunction-associated steatotic liver disease.一种基于数据驱动的方法来解码代谢功能障碍相关脂肪性肝病。

Ann Hepatol. 2024 Mar-Apr;29(2):101278. doi: 10.1016/j.aohep.2023.101278. Epub 2023 Dec 20.

Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms.半监督 ROC 分析用于可靠且精简的表型算法评估。

J Am Med Inform Assoc. 2024 Feb 16;31(3):640-650. doi: 10.1093/jamia/ocad226.

An early warning model of type 2 diabetes risk based on POI visit history and food access management.基于 POI 访问记录和食物获取管理的 2 型糖尿病风险预警模型。

PLoS One. 2023 Jul 26;18(7):e0288231. doi: 10.1371/journal.pone.0288231. eCollection 2023.

本文引用的文献

Scalable relevance ranking algorithm via semantic similarity assessment improves efficiency of medical chart review.通过语义相似性评估的可扩展相关性排序算法提高了医学图表审查的效率。

J Biomed Inform. 2022 Aug;132:104109. doi: 10.1016/j.jbi.2022.104109. Epub 2022 Jun 1.

Automatic phenotyping of electronical health record: PheVis algorithm.电子健康记录的自动表型分析：PheVis算法。

J Biomed Inform. 2021 May;117:103746. doi: 10.1016/j.jbi.2021.103746. Epub 2021 Mar 19.

sureLDA: A multidisease automated phenotyping method for the electronic health record.SureLDA：一种电子健康记录中的多疾病自动化表型方法。

J Am Med Inform Assoc. 2020 Aug 1;27(8):1235-1243. doi: 10.1093/jamia/ocaa079.

Polar labeling: silver standard algorithm for training disease classifiers.极性标记：用于训练疾病分类器的银标准算法。

Bioinformatics. 2020 May 1;36(10):3200-3206. doi: 10.1093/bioinformatics/btaa088.

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP).使用一种常见的半监督方法（PheCAP）对电子病历数据进行高通量表型分析。

Nat Protoc. 2019 Dec;14(12):3426-3444. doi: 10.1038/s41596-019-0227-6. Epub 2019 Nov 20.

Exploring Large-scale Public Medical Image Datasets.探索大规模公共医学图像数据集。

Acad Radiol. 2020 Jan;27(1):106-112. doi: 10.1016/j.acra.2019.10.006. Epub 2019 Nov 6.

High-throughput multimodal automated phenotyping (MAP) with application to PheWAS.高通量多模态自动化表型分析 (MAP) 在 pheWAS 中的应用。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1255-1262. doi: 10.1093/jamia/ocz066.

A clinical text classification paradigm using weak supervision and deep representation.一种使用弱监督和深度表示的临床文本分类范式。

BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.

DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning.深度病变：基于深度学习的大规模病变标注自动挖掘与通用病变检测

J Med Imaging (Bellingham). 2018 Jul;5(3):036501. doi: 10.1117/1.JMI.5.3.036501. Epub 2018 Jul 20.

Pseudogout among Patients Fulfilling a Billing Code Algorithm for Calcium Pyrophosphate Deposition Disease.满足焦磷酸钙沉积病计费算法的患者中的假痛风。

Rheumatol Int. 2018 Jun;38(6):1083-1088. doi: 10.1007/s00296-018-4029-x. Epub 2018 Apr 17.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验