SureLDA：一种电子健康记录中的多疾病自动化表型方法。

sureLDA: A multidisease automated phenotyping method for the electronic health record.

机构信息

Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.

Harvard Medical School, Boston, Massachusetts, USA.

出版信息

J Am Med Inform Assoc. 2020 Aug 1;27(8):1235-1243. doi: 10.1093/jamia/ocaa079.

DOI:10.1093/jamia/ocaa079

PMID:32548637

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7481024/

Abstract

OBJECTIVE

A major bottleneck hindering utilization of electronic health record data for translational research is the lack of precise phenotype labels. Chart review as well as rule-based and supervised phenotyping approaches require laborious expert input, hampering applicability to studies that require many phenotypes to be defined and labeled de novo. Though International Classification of Diseases codes are often used as surrogates for true labels in this setting, these sometimes suffer from poor specificity. We propose a fully automated topic modeling algorithm to simultaneously annotate multiple phenotypes.

MATERIALS AND METHODS

Surrogate-guided ensemble latent Dirichlet allocation (sureLDA) is a label-free multidimensional phenotyping method. It first uses the PheNorm algorithm to initialize probabilities based on 2 surrogate features for each target phenotype, and then leverages these probabilities to constrain the LDA topic model to generate phenotype-specific topics. Finally, it combines phenotype-feature counts with surrogates via clustering ensemble to yield final phenotype probabilities.

RESULTS

sureLDA achieves reliably high accuracy and precision across a range of simulated and real-world phenotypes. Its performance is robust to phenotype prevalence and relative informativeness of surogate vs nonsurrogate features. It also exhibits powerful feature selection properties.

DISCUSSION

sureLDA combines attractive properties of PheNorm and LDA to achieve high accuracy and precision robust to diverse phenotype characteristics. It offers particular improvement for phenotypes insufficiently captured by a few surrogate features. Moreover, sureLDA's feature selection ability enables it to handle high feature dimensions and produce interpretable computational phenotypes.

CONCLUSIONS

sureLDA is well suited toward large-scale electronic health record phenotyping for highly multiphenotype applications such as phenome-wide association studies .

摘要

目的

电子健康记录数据在转化研究中的应用受到一个主要瓶颈的限制，即缺乏精确的表型标签。图表审查以及基于规则和监督的表型方法需要费力的专家投入，这阻碍了需要定义和标记许多新表型的研究的适用性。虽然在这种情况下，国际疾病分类代码通常被用作真实标签的替代品，但这些代码有时特异性较差。我们提出了一种完全自动化的主题建模算法，以同时注释多个表型。

材料和方法

Surrogate-guided 集成潜在狄利克雷分配（sureLDA）是一种无标签多维表型分析方法。它首先使用 PheNorm 算法根据每个目标表型的 2 个替代特征初始化概率，然后利用这些概率来约束 LDA 主题模型生成特定于表型的主题。最后，它通过聚类集成将表型-特征计数与替代物结合起来，得到最终的表型概率。

结果

sureLDA 在一系列模拟和真实世界的表型中都能可靠地实现高准确性和高精度。其性能对表型流行率以及替代物与非替代物特征的相对信息量具有鲁棒性。它还具有强大的特征选择特性。

讨论

sureLDA 将 PheNorm 和 LDA 的吸引人的特性结合起来，实现了对各种表型特征具有鲁棒性的高准确性和高精度。它为少数替代物特征不足以捕捉到的表型提供了特别的改进。此外，sureLDA 的特征选择能力使其能够处理高特征维度并产生可解释的计算表型。

结论

sureLDA 非常适合用于大规模电子健康记录表型分析，适用于表型广泛的关联研究等高度多表型应用。

相似文献

sureLDA: A multidisease automated phenotyping method for the electronic health record.SureLDA：一种电子健康记录中的多疾病自动化表型方法。

J Am Med Inform Assoc. 2020 Aug 1;27(8):1235-1243. doi: 10.1093/jamia/ocaa079.

Enabling phenotypic big data with PheNorm.利用 PheNorm 实现表型大数据。

J Am Med Inform Assoc. 2018 Jan 1;25(1):54-60. doi: 10.1093/jamia/ocx111.

Automated feature selection of predictors in electronic medical records data.电子病历数据中预测指标的自动特征选择

Biometrics. 2019 Mar;75(1):268-277. doi: 10.1111/biom.12987. Epub 2019 Apr 2.

High-throughput multimodal automated phenotyping (MAP) with application to PheWAS.高通量多模态自动化表型分析 (MAP) 在 pheWAS 中的应用。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1255-1262. doi: 10.1093/jamia/ocz066.

Surrogate-assisted feature extraction for high-throughput phenotyping.用于高通量表型分析的代理辅助特征提取

J Am Med Inform Assoc. 2017 Apr 1;24(e1):e143-e149. doi: 10.1093/jamia/ocw135.

Weakly Semi-supervised phenotyping using Electronic Health records.基于电子健康记录的弱监督表型研究

J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.

MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record.混合 EHR 引导：一种使用电子健康记录进行大规模自动表型分析的引导式多模态主题建模方法。

J Biomed Inform. 2022 Oct;134:104190. doi: 10.1016/j.jbi.2022.104190. Epub 2022 Sep 1.

Scalable relevance ranking algorithm via semantic similarity assessment improves efficiency of medical chart review.通过语义相似性评估的可扩展相关性排序算法提高了医学图表审查的效率。

J Biomed Inform. 2022 Aug;132:104109. doi: 10.1016/j.jbi.2022.104109. Epub 2022 Jun 1.

Automatic phenotyping of electronical health record: PheVis algorithm.电子健康记录的自动表型分析：PheVis算法。

J Biomed Inform. 2021 May;117:103746. doi: 10.1016/j.jbi.2021.103746. Epub 2021 Mar 19.

Feature extraction for phenotyping from semantic and knowledge resources.从语义和知识资源中进行表型特征提取。

J Biomed Inform. 2019 Mar;91:103122. doi: 10.1016/j.jbi.2019.103122. Epub 2019 Feb 7.

引用本文的文献

Automated Shared Phenotype Discovery in Undiagnosed Cohorts for Rare Disease Research.罕见病研究中未确诊队列的自动化共享表型发现

Proc Int Conf Mach Learn Appl. 2024 Dec;2024:1025-1030. doi: 10.1109/icmla61862.2024.00154. Epub 2025 Mar 4.

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.ARCH：通过汇总叙述性编码健康记录分析构建大规模知识图谱

J Biomed Inform. 2025 Feb;162:104761. doi: 10.1016/j.jbi.2024.104761. Epub 2025 Jan 23.

LATTE: Label-efficient incident phenotyping from longitudinal electronic health records.LATTE：从纵向电子健康记录中进行高效标签事件表型分析。

Patterns (N Y). 2023 Dec 27;5(1):100906. doi: 10.1016/j.patter.2023.100906. eCollection 2024 Jan 12.

Finding Potential Adverse Events in the Unstructured Text of Electronic Health Care Records: Development of the Shakespeare Method.在电子医疗记录的非结构化文本中发现潜在不良事件：莎士比亚方法的开发。

JMIRx Med. 2021 Aug 11;2(3):e27017. doi: 10.2196/27017.

GTM-decon: guided-topic modeling of single-cell transcriptomes enables sub-cell-type and disease-subtype deconvolution of bulk transcriptomes.GTM-decon：单细胞转录组的引导主题建模能够对批量转录组进行亚细胞类型和疾病亚型的分解。

Genome Biol. 2023 Aug 18;24(1):190. doi: 10.1186/s13059-023-03034-4.

Generate Analysis-Ready Data for Real-world Evidence: Tutorial for Harnessing Electronic Health Records With Advanced Informatic Technologies.为真实世界证据生成可分析数据：利用先进信息学技术驾驭电子健康记录的教程。

J Med Internet Res. 2023 May 25;25:e45662. doi: 10.2196/45662.

Machine learning approaches for electronic health records phenotyping: a methodical review.基于机器学习的电子健康记录表型分析方法：系统评价

J Am Med Inform Assoc. 2023 Jan 18;30(2):367-381. doi: 10.1093/jamia/ocac216.

Modeling electronic health record data using an end-to-end knowledge-graph-informed topic model.使用端到端知识图启发的主题模型对电子健康记录数据进行建模。

Sci Rep. 2022 Oct 25;12(1):17868. doi: 10.1038/s41598-022-22956-w.

A semi-supervised adaptive Markov Gaussian embedding process (SAMGEP) for prediction of phenotype event times using the electronic health record.基于电子健康记录的表型事件时间预测的半监督自适应马尔可夫高斯嵌入过程 (SAMGEP)。

Sci Rep. 2022 Oct 22;12(1):17737. doi: 10.1038/s41598-022-22585-3.

Weakly Semi-supervised phenotyping using Electronic Health records.基于电子健康记录的弱监督表型研究

J Biomed Inform. 2022 Oct;134:104175. doi: 10.1016/j.jbi.2022.104175. Epub 2022 Sep 5.

本文引用的文献

High-throughput multimodal automated phenotyping (MAP) with application to PheWAS.高通量多模态自动化表型分析 (MAP) 在 pheWAS 中的应用。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1255-1262. doi: 10.1093/jamia/ocz066.

Mapping ICD-10 and ICD-10-CM Codes to Phecodes: Workflow Development and Initial Evaluation.将ICD - 10和ICD - 10 - CM编码映射到疾病编码：工作流程开发与初步评估

JMIR Med Inform. 2019 Nov 29;7(4):e14325. doi: 10.2196/14325.

Scalable and accurate deep learning with electronic health records.借助电子健康记录实现可扩展且准确的深度学习。

NPJ Digit Med. 2018 May 8;1:18. doi: 10.1038/s41746-018-0029-1. eCollection 2018.

Methodological variations in lagged regression for detecting physiologic drug effects in EHR data.滞后回归法在电子健康记录数据中检测药物生理效应的方法学变异。

J Biomed Inform. 2018 Oct;86:149-159. doi: 10.1016/j.jbi.2018.08.014. Epub 2018 Aug 30.

Enabling phenotypic big data with PheNorm.利用 PheNorm 实现表型大数据。

J Am Med Inform Assoc. 2018 Jan 1;25(1):54-60. doi: 10.1093/jamia/ocx111.

Semi-supervised learning of the electronic health record for phenotype stratification.用于表型分层的电子健康记录的半监督学习

J Biomed Inform. 2016 Dec;64:168-178. doi: 10.1016/j.jbi.2016.10.007. Epub 2016 Oct 12.

Surrogate-assisted feature extraction for high-throughput phenotyping.用于高通量表型分析的代理辅助特征提取

J Am Med Inform Assoc. 2017 Apr 1;24(e1):e143-e149. doi: 10.1093/jamia/ocw135.

Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records.深度患者：一种从电子健康记录中预测患者未来的无监督表示。

Sci Rep. 2016 May 17;6:26094. doi: 10.1038/srep26094.

Learning statistical models of phenotypes using noisy labeled training data.使用带有噪声标签的训练数据学习表型的统计模型。

J Am Med Inform Assoc. 2016 Nov;23(6):1166-1173. doi: 10.1093/jamia/ocw028. Epub 2016 May 12.

Electronic medical record phenotyping using the anchor and learn framework.使用锚定与学习框架进行电子病历表型分析。

J Am Med Inform Assoc. 2016 Jul;23(4):731-40. doi: 10.1093/jamia/ocw011. Epub 2016 Apr 23.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验