Gentry Amanda Elswick, Kirkpatrick Robert M, Peterson Roseann E, Webb Bradley T
Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, United States.
Department of Psychiatry and Behavioral Sciences, Institute for Genomics in Health, SUNY Downstate Health Sciences University, Brooklyn, NY, United States.
Front Genet. 2023 Jul 20;14:1162690. doi: 10.3389/fgene.2023.1162690. eCollection 2023.
The availability of large-scale biobanks linking genetic data, rich phenotypes, and biological measures is a powerful opportunity for scientific discovery. However, real-world collections frequently have extensive missingness. While missing data prediction is possible, performance is significantly impaired by block-wise missingness inherent to many biobanks. To address this, we developed Missingness Adapted Group-wise Informed Clustered (MAGIC)-LASSO which performs hierarchical clustering of variables based on missingness followed by sequential Group LASSO within clusters. Variables are pre-filtered for missingness and balance between training and target sets with final models built using stepwise inclusion of features ranked by completeness. This research has been conducted using the UK Biobank ( > 500 k) to predict unmeasured Alcohol Use Disorders Identification Test (AUDIT) scores. The phenotypic correlation between measured and predicted total score was 0.67 while genetic correlations between independent subjects was high >0.86. Phenotypic and genetic correlations in real data application, as well as simulations, demonstrate the method has significant accuracy and utility for increasing power for genetic loci discovery.
将基因数据、丰富的表型和生物学指标相联系的大规模生物样本库为科学发现提供了强大机遇。然而,实际收集的数据常常存在大量缺失值。虽然缺失数据预测是可行的,但许多生物样本库中固有的分块缺失值会显著降低预测性能。为解决这一问题,我们开发了缺失值适应性分组知情聚类(MAGIC)-套索算法,该算法首先基于缺失值对变量进行层次聚类,然后在聚类内进行顺序分组套索。变量会根据缺失值进行预筛选,并在训练集和目标集之间进行平衡,最终模型通过逐步纳入按完整性排序的特征来构建。本研究使用英国生物样本库(超过50万样本)来预测未测量的酒精使用障碍识别测试(AUDIT)分数。测量得分与预测总分之间的表型相关性为0.67,而独立受试者之间的遗传相关性较高,大于0.86。实际数据应用以及模拟中的表型和遗传相关性表明,该方法在提高基因位点发现能力方面具有显著的准确性和实用性。