Suppr超能文献

卷积嵌入网络在群体规模聚类和生物亲缘推断中的应用。

Convolutional Embedded Networks for Population Scale Clustering and Bio-Ancestry Inferencing.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):369-382. doi: 10.1109/TCBB.2020.2994649. Epub 2022 Feb 3.

Abstract

The study of genetic variants (GVs) can help find correlating population groups and to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning techniques are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks (DNNs) can learn non-linear mappings that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we propose convolutional embedded networks (CEN) in which we combine two DNN architectures called convolutional embedded clustering (CEC) and convolutional autoencoder (CAE) classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning to 95 million GVs from the '1000 genomes' (covering 2,504 individuals from 26 ethnic origins) and 'Simons genome diversity' (covering 279 individuals from 130 ethnic origins) projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index (ARI) of 0.915, the normalized mutual information (NMI) of 0.92, and the clustering accuracy (ACC) of 89 percent. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient (MCC) score of 0.9004 and 0.8245, respectively. Further, to provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees (GBT) and SHapley Additive exPlanations (SHAP). Overall, our approach is transparent and faster than the baseline methods, and scalable for 5 to 100 percent of the full human genome.

摘要

对遗传变异(GVs)的研究可以帮助找到相关的人群群体,并确定易患常见疾病的队列,并解释疾病易感性的差异以及患者对药物的反应。机器学习技术越来越多地被应用于识别相互作用的 GVs,以了解它们复杂的表型特征。由于学习算法的性能不仅取决于数据的大小和性质,还取决于基础表示的质量,因此深度神经网络(DNN)可以学习非线性映射,将 GVs 数据转换为比手动特征选择更适合聚类和分类的表示。在本文中,我们提出了卷积嵌入式网络(CEN),其中我们结合了两种称为卷积嵌入式聚类(CEC)和卷积自动编码器(CAE)分类器的 DNN 架构,分别用于聚类个体和基于 GVs 预测地理种族。我们使用基于 CAE 的表示学习方法对来自“1000 个基因组”(涵盖来自 26 个种族的 2504 个人)和“西蒙斯基因组多样性”(涵盖来自 130 个种族的 279 个人)项目的 9500 万个 GVs 进行了分析。定量和定性分析侧重于准确性和可扩展性,表明我们的方法优于最先进的方法,如 VariantSpark 和 ADMIXTURE。特别是,CEC 可以在 22 小时内聚类目标人群,调整后的兰德指数(ARI)为 0.915,归一化互信息(NMI)为 0.92,聚类准确率(ACC)为 89%。相反,CAE 分类器可以使用梯度提升树(GBT)和 Shapley 加性解释(SHAP)识别显著的生物标志物,以预测未知样本的地理种族,F1 和 Matthews 相关系数(MCC)得分分别为 0.9004 和 0.8245。此外,为了提供预测的解释,我们使用梯度提升树(GBT)和 Shapley 加性解释(SHAP)识别显著的生物标志物。总的来说,我们的方法比基线方法更透明、更快,并且可以扩展到 5%到 100%的全人类基因组。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验