POPDx：英国生物库研究中对 392246 个人进行患者表型分析的自动化框架。

POPDx: an automated framework for patient phenotyping across 392 246 individuals in the UK Biobank study.

机构信息

Department of Bioengineering, Stanford University, Stanford, CA, USA.

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.

出版信息

J Am Med Inform Assoc. 2023 Jan 18;30(2):245-255. doi: 10.1093/jamia/ocac226.

DOI:10.1093/jamia/ocac226

PMID:36469791

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9846671/

Abstract

OBJECTIVE

For the UK Biobank, standardized phenotype codes are associated with patients who have been hospitalized but are missing for many patients who have been treated exclusively in an outpatient setting. We describe a method for phenotype recognition that imputes phenotype codes for all UK Biobank participants.

MATERIALS AND METHODS

POPDx (Population-based Objective Phenotyping by Deep Extrapolation) is a bilinear machine learning framework for simultaneously estimating the probabilities of 1538 phenotype codes. We extracted phenotypic and health-related information of 392 246 individuals from the UK Biobank for POPDx development and evaluation. A total of 12 803 ICD-10 diagnosis codes of the patients were converted to 1538 phecodes as gold standard labels. The POPDx framework was evaluated and compared to other available methods on automated multiphenotype recognition.

RESULTS

POPDx can predict phenotypes that are rare or even unobserved in training. We demonstrate substantial improvement of automated multiphenotype recognition across 22 disease categories, and its application in identifying key epidemiological features associated with each phenotype.

CONCLUSIONS

POPDx helps provide well-defined cohorts for downstream studies. It is a general-purpose method that can be applied to other biobanks with diverse but incomplete data.

摘要

目的

对于英国生物银行（UK Biobank），标准化的表型代码与已住院的患者相关联，但对于许多仅在门诊治疗的患者来说，这些代码是缺失的。我们描述了一种用于表型识别的方法，该方法可以为所有 UK Biobank 参与者推断表型代码。

材料和方法

POPDx（基于人群的通过深度外推进行客观表型识别）是一种双线性机器学习框架，用于同时估计 1538 种表型代码的概率。我们从 UK Biobank 中提取了 392246 个人的表型和与健康相关的信息，用于 POPDx 的开发和评估。患者的总共 12803 个 ICD-10 诊断代码被转换为 1538 个 phecodes 作为金标准标签。我们评估了 POPDx 框架，并将其与其他可用的自动多表型识别方法进行了比较。

结果

POPDx 可以预测在训练中罕见甚至未观察到的表型。我们在 22 种疾病类别中证明了自动多表型识别的显著改进，以及它在识别与每种表型相关的关键流行病学特征方面的应用。

结论

POPDx 有助于为下游研究提供明确界定的队列。它是一种通用方法，可应用于具有不同但不完整数据的其他生物库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a884/9846671/394f67dd1fa4/ocac226f1.jpg

相似文献

POPDx: an automated framework for patient phenotyping across 392 246 individuals in the UK Biobank study.

J Am Med Inform Assoc. 2023 Jan 18;30(2):245-255. doi: 10.1093/jamia/ocac226.

Enhanced rare disease mapping for phenome-wide genetic association in the UK Biobank.

Genome Med. 2022 Aug 9;14(1):85. doi: 10.1186/s13073-022-01094-y.

Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries.

Nat Genet. 2023 Dec;55(12):2269-2276. doi: 10.1038/s41588-023-01558-w. Epub 2023 Nov 20.

CATI: A medical context-enhanced framework for diagnosis code assignment in the UK Biobank study.

Artif Intell Med. 2025 Aug;166:103136. doi: 10.1016/j.artmed.2025.103136. Epub 2025 May 2.

Genetic and Phenotypic Features of Schizophrenia in the UK Biobank.

JAMA Psychiatry. 2024 Jul 1;81(7):681-690. doi: 10.1001/jamapsychiatry.2024.0200.

UK Biobank MRI data can power the development of generalizable brain clocks: A study of standard ML/DL methodologies and performance analysis on external databases.

Neuroimage. 2025 Mar;308:121064. doi: 10.1016/j.neuroimage.2025.121064. Epub 2025 Jan 30.

Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants.

PLoS One. 2019 May 15;14(5):e0213653. doi: 10.1371/journal.pone.0213653. eCollection 2019.

Reproducible disease phenotyping at scale: Example of coronary artery disease in UK Biobank.

PLoS One. 2022 Apr 5;17(4):e0264828. doi: 10.1371/journal.pone.0264828. eCollection 2022.

Hypertrophic Cardiomyopathy in the General Population: Leveraging the UK Biobank Database and Machine Learning Phenotyping.

J Am Coll Cardiol. 2021 Sep 14;78(11):1111-1113. doi: 10.1016/j.jacc.2021.07.036.

Evaluation of polygenic scoring methods in five biobanks shows larger variation between biobanks than methods and finds benefits of ensemble learning.

Am J Hum Genet. 2024 Jul 11;111(7):1431-1447. doi: 10.1016/j.ajhg.2024.06.003. Epub 2024 Jun 21.

引用本文的文献

UKB-MDRMF: a multi-disease risk and multimorbidity framework based on UK biobank data.

Nat Commun. 2025 Apr 22;16(1):3767. doi: 10.1038/s41467-025-58724-3.

Genetic association studies using disease liabilities from deep neural networks.

Am J Hum Genet. 2025 Mar 6;112(3):675-692. doi: 10.1016/j.ajhg.2025.01.019. Epub 2025 Feb 21.

Genetic association studies using disease liabilities from deep neural networks.

medRxiv. 2024 Sep 8:2023.01.18.23284383. doi: 10.1101/2023.01.18.23284383.

Advancing phenotyping through informatics innovation.

J Am Med Inform Assoc. 2023 Jan 18;30(2):211-212. doi: 10.1093/jamia/ocac247.

本文引用的文献

Leveraging the Cell Ontology to classify unseen cell types.

Nat Commun. 2021 Sep 21;12(1):5556. doi: 10.1038/s41467-021-25725-x.

Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS.

Annu Rev Biomed Data Sci. 2021 Jul 20;4:1-19. doi: 10.1146/annurev-biodatasci-122320-112352.

Assessing the Uniformity of Uveitis Clinical Concepts and Associated ICD-10 Codes Across Health Care Systems Sharing the Same Electronic Health Records System.

JAMA Ophthalmol. 2021 Aug 1;139(8):887-894. doi: 10.1001/jamaophthalmol.2021.2045.

MARS: discovering novel cell types across heterogeneous single-cell experiments.

Nat Methods. 2020 Dec;17(12):1200-1206. doi: 10.1038/s41592-020-00979-3. Epub 2020 Oct 19.

The use of machine learning in rare diseases: a scoping review.

Orphanet J Rare Dis. 2020 Jun 9;15(1):145. doi: 10.1186/s13023-020-01424-6.

Mapping ICD-10 and ICD-10-CM Codes to Phecodes: Workflow Development and Initial Evaluation.

JMIR Med Inform. 2019 Nov 29;7(4):e14325. doi: 10.2196/14325.

Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database.

Eur J Hum Genet. 2020 Feb;28(2):165-173. doi: 10.1038/s41431-019-0508-0. Epub 2019 Sep 16.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

A Novel Deep Neural Network Model for Multi-Label Chronic Disease Prediction.

Front Genet. 2019 Apr 24;10:351. doi: 10.3389/fgene.2019.00351. eCollection 2019.

MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction.

Methods. 2019 Aug 15;166:74-82. doi: 10.1016/j.ymeth.2019.03.003. Epub 2019 Mar 16.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

POPDx：英国生物库研究中对 392246 个人进行患者表型分析的自动化框架。

POPDx: an automated framework for patient phenotyping across 392 246 individuals in the UK Biobank study.

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSIONS

目的

材料和方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献