Sharma Jyoti, Jangale Vaishnavi, Shekhawat Rajveer Singh, Yadav Pankaj
Department of Bioscience & Bioengineering, Indian Institute of Technology, Jodhpur, 342030, Rajasthan, India.
School of Artificial Intelligence and Data Science, Indian Institute of Technology, Jodhpur, 342030, Rajasthan, India.
BMC Genomics. 2025 Mar 12;26(1):237. doi: 10.1186/s12864-025-11443-x.
Genome-wide association studies (GWAS) are rapidly advancing due to the improved resolution and completeness provided by Telomere-to-Telomere (T2T) and pangenome assemblies. While recent advancements in GWAS methods have primarily focused on identifying genetic variants associated with discrete phenotypes, approaches for quantitative traits (QTs) remain underdeveloped. This has often led to significant variants being overlooked due to biases from genotype multicollinearity and strict p-value thresholds.
We propose an enhanced ensemble learning approach for QT analysis that integrates regularized variant selection with machine learning-based association methods, validated through comprehensive biological enrichment analysis. We benchmarked four widely recognized single nucleotide polymorphism (SNP) feature selection methods-least absolute shrinkage and selection operator, ridge regression, elastic-net, and mutual information-alongside four association methods: linear regression, random forest, support vector regression (SVR), and XGBoost. Our approach is evaluated on simulated datasets and validated using a subset of the PennCATH real dataset, including imputed versions, focusing on low-density lipoprotein (LDL)-cholesterol levels as a QT. The combination of elastic-net with SVR outperformed other methods across all datasets. Functional annotation of top 100 SNPs identified through this superior ensemble method revealed their expression in tissues involved in LDL cholesterol regulation. We also confirmed the involvement of six known genes (APOB, TRAPPC9, RAB2A, CCL24, FCHO2, and EEPD1) in cholesterol-related pathways and identified potential drug targets, including APOB, PTK2B, and PTPN12.
In conclusion, our ensemble learning approach effectively identifies variants associated with QTs, and we expect its performance to improve further with the integration of T2T and pangenome references in future GWAS.
由于端粒到端粒(T2T)和泛基因组组装提供了更高的分辨率和完整性,全基因组关联研究(GWAS)正在迅速发展。虽然GWAS方法的最新进展主要集中在识别与离散表型相关的遗传变异,但用于定量性状(QT)的方法仍未得到充分发展。这常常导致由于基因型多重共线性和严格的p值阈值产生的偏差而忽略了显著变异。
我们提出了一种用于QT分析的增强集成学习方法,该方法将正则化变异选择与基于机器学习的关联方法相结合,并通过全面的生物富集分析进行了验证。我们对四种广泛认可的单核苷酸多态性(SNP)特征选择方法——最小绝对收缩和选择算子、岭回归、弹性网络和互信息——以及四种关联方法:线性回归、随机森林、支持向量回归(SVR)和XGBoost进行了基准测试。我们的方法在模拟数据集上进行了评估,并使用PennCATH真实数据集的一个子集(包括推算版本)进行了验证,重点关注作为QT的低密度脂蛋白(LDL)胆固醇水平。在所有数据集中,弹性网络与SVR的组合优于其他方法。通过这种优越的集成方法鉴定出的前100个SNP的功能注释揭示了它们在参与LDL胆固醇调节的组织中的表达。我们还证实了六个已知基因(APOB、TRAPPC9、RAB2A、CCL24、FCHO2和EEPD1)参与胆固醇相关途径,并确定了潜在的药物靶点,包括APOB、PTK2B和PTPN12。
总之,我们的集成学习方法有效地识别了与QT相关的变异,并且我们预计随着未来GWAS中T2T和泛基因组参考的整合,其性能将进一步提高。