缺失性适应群体信息聚类（MAGIC）-套索算法：一种用于表型预测以提高遗传位点发现效能的新范式。

Missingness adapted group informed clustered (MAGIC)-LASSO: a novel paradigm for phenotype prediction to improve power for genetic loci discovery.

作者信息

Gentry Amanda Elswick, Kirkpatrick Robert M, Peterson Roseann E, Webb Bradley T

机构信息

Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, United States.

Department of Psychiatry and Behavioral Sciences, Institute for Genomics in Health, SUNY Downstate Health Sciences University, Brooklyn, NY, United States.

出版信息

Front Genet. 2023 Jul 20;14:1162690. doi: 10.3389/fgene.2023.1162690. eCollection 2023.

DOI:10.3389/fgene.2023.1162690

PMID:37547462

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10399453/

Abstract

The availability of large-scale biobanks linking genetic data, rich phenotypes, and biological measures is a powerful opportunity for scientific discovery. However, real-world collections frequently have extensive missingness. While missing data prediction is possible, performance is significantly impaired by block-wise missingness inherent to many biobanks. To address this, we developed Missingness Adapted Group-wise Informed Clustered (MAGIC)-LASSO which performs hierarchical clustering of variables based on missingness followed by sequential Group LASSO within clusters. Variables are pre-filtered for missingness and balance between training and target sets with final models built using stepwise inclusion of features ranked by completeness. This research has been conducted using the UK Biobank ( > 500 k) to predict unmeasured Alcohol Use Disorders Identification Test (AUDIT) scores. The phenotypic correlation between measured and predicted total score was 0.67 while genetic correlations between independent subjects was high >0.86. Phenotypic and genetic correlations in real data application, as well as simulations, demonstrate the method has significant accuracy and utility for increasing power for genetic loci discovery.

摘要

将基因数据、丰富的表型和生物学指标相联系的大规模生物样本库为科学发现提供了强大机遇。然而，实际收集的数据常常存在大量缺失值。虽然缺失数据预测是可行的，但许多生物样本库中固有的分块缺失值会显著降低预测性能。为解决这一问题，我们开发了缺失值适应性分组知情聚类（MAGIC）-套索算法，该算法首先基于缺失值对变量进行层次聚类，然后在聚类内进行顺序分组套索。变量会根据缺失值进行预筛选，并在训练集和目标集之间进行平衡，最终模型通过逐步纳入按完整性排序的特征来构建。本研究使用英国生物样本库（超过50万样本）来预测未测量的酒精使用障碍识别测试（AUDIT）分数。测量得分与预测总分之间的表型相关性为0.67，而独立受试者之间的遗传相关性较高，大于0.86。实际数据应用以及模拟中的表型和遗传相关性表明，该方法在提高基因位点发现能力方面具有显著的准确性和实用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e5a/10399453/217b600d8090/fgene-14-1162690-g001.jpg

相似文献

Missingness adapted group informed clustered (MAGIC)-LASSO: a novel paradigm for phenotype prediction to improve power for genetic loci discovery.缺失性适应群体信息聚类（MAGIC）-套索算法：一种用于表型预测以提高遗传位点发现效能的新范式。

Front Genet. 2023 Jul 20;14:1162690. doi: 10.3389/fgene.2023.1162690. eCollection 2023.

Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries.基于深度学习的人群规模生物库数据表型推断可增加遗传发现。

Nat Genet. 2023 Dec;55(12):2269-2276. doi: 10.1038/s41588-023-01558-w. Epub 2023 Nov 20.

On Missingness Features in Machine Learning Models for Critical Care: Observational Study.重症监护机器学习模型中的缺失特征：观察性研究

JMIR Med Inform. 2021 Dec 8;9(12):e25022. doi: 10.2196/25022.

Genome-Wide Association Study Meta-Analysis of the Alcohol Use Disorders Identification Test (AUDIT) in Two Population-Based Cohorts.基于两个人群队列的酒精使用障碍识别测试（AUDIT）全基因组关联研究的荟萃分析。

Am J Psychiatry. 2019 Feb 1;176(2):107-118. doi: 10.1176/appi.ajp.2018.18040369. Epub 2018 Oct 19.

A Novel LSTM for Multivariate Time Series with Massive Missingness.一种用于具有大量缺失值的多元时间序列的新型 LSTM。

Sensors (Basel). 2020 May 16;20(10):2832. doi: 10.3390/s20102832.

The impact of imputation quality on machine learning classifiers for datasets with missing values.插补质量对具有缺失值数据集的机器学习分类器的影响。

Commun Med (Lond). 2023 Oct 6;3(1):139. doi: 10.1038/s43856-023-00356-z.

Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review.机器学习预测模型研究中对缺失数据的处理和报告很差劲：文献综述。

J Clin Epidemiol. 2022 Feb;142:218-229. doi: 10.1016/j.jclinepi.2021.11.023. Epub 2021 Nov 16.

Multiple imputation with missing data indicators.带有缺失数据指标的多重插补。

Stat Methods Med Res. 2021 Dec;30(12):2685-2700. doi: 10.1177/09622802211047346. Epub 2021 Oct 13.

Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks.合成替代物可提高在人群生物库中对部分缺失表型进行全基因组关联研究的功效。

Nat Genet. 2024 Jul;56(7):1527-1536. doi: 10.1038/s41588-024-01793-9. Epub 2024 Jun 13.

Improving GWAS discovery and genomic prediction accuracy in biobank data.提高生物库数据中 GWAS 发现和基因组预测准确性。

Proc Natl Acad Sci U S A. 2022 Aug 2;119(31):e2121279119. doi: 10.1073/pnas.2121279119. Epub 2022 Jul 29.

引用本文的文献

A generative model for evaluating missing data methods in large epidemiological cohorts.一种用于评估大型流行病学队列中缺失数据方法的生成模型。

BMC Med Res Methodol. 2025 Feb 8;25(1):34. doi: 10.1186/s12874-025-02487-4.

Multi-omics regulatory network inference in the presence of missing data.存在缺失数据时的多组学调控网络推断。

Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad309.

本文引用的文献

The "All of Us" Research Program.“All of Us”研究计划。

N Engl J Med. 2019 Aug 15;381(7):668-676. doi: 10.1056/NEJMsr1809937.

Comprehensive functional genomic resource and integrative model for the human brain.人类大脑的综合功能基因组资源和整合模型。

Science. 2018 Dec 14;362(6420). doi: 10.1126/science.aat8464.

The UK Biobank resource with deep phenotyping and genomic data.英国生物银行资源库，具有深度表型和基因组数据。

Nature. 2018 Oct;562(7726):203-209. doi: 10.1038/s41586-018-0579-z. Epub 2018 Oct 10.

Evaluation of a phenotype imputation approach using GAW20 simulated data.使用GAW20模拟数据对一种表型推算方法的评估。

BMC Proc. 2018 Sep 17;12(Suppl 9):56. doi: 10.1186/s12919-018-0134-9. eCollection 2018.

Genome-wide association study of alcohol use disorder identification test (AUDIT) scores in 20 328 research participants of European ancestry.全基因组关联研究 20328 名欧洲血统研究参与者的酒精使用障碍识别测试（AUDIT）评分。

Addict Biol. 2019 Jan;24(1):121-131. doi: 10.1111/adb.12574. Epub 2017 Oct 23.

Overview of the BioBank Japan Project: Study design and profile.日本生物样本库项目概述：研究设计与概况

J Epidemiol. 2017 Mar;27(3S):S2-S8. doi: 10.1016/j.je.2016.12.005. Epub 2017 Feb 8.

Imputing Phenotypes for Genome-wide Association Studies.全基因组关联研究中的表型填补

Am J Hum Genet. 2016 Jul 7;99(1):89-103. doi: 10.1016/j.ajhg.2016.04.013. Epub 2016 Jun 9.

A multiple-phenotype imputation method for genetic studies.一种用于基因研究的多表型插补方法。

Nat Genet. 2016 Apr;48(4):466-72. doi: 10.1038/ng.3513. Epub 2016 Feb 22.

UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.英国生物银行：一个用于识别多种中老年复杂疾病病因的开放获取资源。

PLoS Med. 2015 Mar 31;12(3):e1001779. doi: 10.1371/journal.pmed.1001779. eCollection 2015 Mar.

Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors.具有分组预测变量的非凸惩罚线性和逻辑回归模型的分组下降算法。

Stat Comput. 2015 Mar;25(2):173-187. doi: 10.1007/s11222-013-9424-2.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

缺失性适应群体信息聚类（MAGIC）-套索算法：一种用于表型预测以提高遗传位点发现效能的新范式。

Missingness adapted group informed clustered (MAGIC)-LASSO: a novel paradigm for phenotype prediction to improve power for genetic loci discovery.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献