• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

卷积嵌入网络在群体规模聚类和生物亲缘推断中的应用。

Convolutional Embedded Networks for Population Scale Clustering and Bio-Ancestry Inferencing.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):369-382. doi: 10.1109/TCBB.2020.2994649. Epub 2022 Feb 3.

DOI:10.1109/TCBB.2020.2994649
PMID:32750845
Abstract

The study of genetic variants (GVs) can help find correlating population groups and to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning techniques are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks (DNNs) can learn non-linear mappings that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we propose convolutional embedded networks (CEN) in which we combine two DNN architectures called convolutional embedded clustering (CEC) and convolutional autoencoder (CAE) classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning to 95 million GVs from the '1000 genomes' (covering 2,504 individuals from 26 ethnic origins) and 'Simons genome diversity' (covering 279 individuals from 130 ethnic origins) projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index (ARI) of 0.915, the normalized mutual information (NMI) of 0.92, and the clustering accuracy (ACC) of 89 percent. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient (MCC) score of 0.9004 and 0.8245, respectively. Further, to provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees (GBT) and SHapley Additive exPlanations (SHAP). Overall, our approach is transparent and faster than the baseline methods, and scalable for 5 to 100 percent of the full human genome.

摘要

对遗传变异(GVs)的研究可以帮助找到相关的人群群体,并确定易患常见疾病的队列,并解释疾病易感性的差异以及患者对药物的反应。机器学习技术越来越多地被应用于识别相互作用的 GVs,以了解它们复杂的表型特征。由于学习算法的性能不仅取决于数据的大小和性质,还取决于基础表示的质量,因此深度神经网络(DNN)可以学习非线性映射,将 GVs 数据转换为比手动特征选择更适合聚类和分类的表示。在本文中,我们提出了卷积嵌入式网络(CEN),其中我们结合了两种称为卷积嵌入式聚类(CEC)和卷积自动编码器(CAE)分类器的 DNN 架构,分别用于聚类个体和基于 GVs 预测地理种族。我们使用基于 CAE 的表示学习方法对来自“1000 个基因组”(涵盖来自 26 个种族的 2504 个人)和“西蒙斯基因组多样性”(涵盖来自 130 个种族的 279 个人)项目的 9500 万个 GVs 进行了分析。定量和定性分析侧重于准确性和可扩展性,表明我们的方法优于最先进的方法,如 VariantSpark 和 ADMIXTURE。特别是,CEC 可以在 22 小时内聚类目标人群,调整后的兰德指数(ARI)为 0.915,归一化互信息(NMI)为 0.92,聚类准确率(ACC)为 89%。相反,CAE 分类器可以使用梯度提升树(GBT)和 Shapley 加性解释(SHAP)识别显著的生物标志物,以预测未知样本的地理种族,F1 和 Matthews 相关系数(MCC)得分分别为 0.9004 和 0.8245。此外,为了提供预测的解释,我们使用梯度提升树(GBT)和 Shapley 加性解释(SHAP)识别显著的生物标志物。总的来说,我们的方法比基线方法更透明、更快,并且可以扩展到 5%到 100%的全人类基因组。

相似文献

1
Convolutional Embedded Networks for Population Scale Clustering and Bio-Ancestry Inferencing.卷积嵌入网络在群体规模聚类和生物亲缘推断中的应用。
IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):369-382. doi: 10.1109/TCBB.2020.2994649. Epub 2022 Feb 3.
2
Verifying explainability of a deep learning tissue classifier trained on RNA-seq data.验证基于 RNA-seq 数据训练的深度学习组织分类器的可解释性。
Sci Rep. 2021 Jan 29;11(1):2641. doi: 10.1038/s41598-021-81773-9.
3
Classification and Explanation for Intrusion Detection System Based on Ensemble Trees and SHAP Method.基于集成树和 SHAP 方法的入侵检测系统分类与解释。
Sensors (Basel). 2022 Feb 3;22(3):1154. doi: 10.3390/s22031154.
4
Orthogonal convolutional neural networks for automatic sleep stage classification based on single-channel EEG.基于单通道 EEG 的自动睡眠分期的正交卷积神经网络。
Comput Methods Programs Biomed. 2020 Jan;183:105089. doi: 10.1016/j.cmpb.2019.105089. Epub 2019 Sep 27.
5
Compressing Deep Networks by Neuron Agglomerative Clustering.通过神经元聚合聚类压缩深度网络
Sensors (Basel). 2020 Oct 23;20(21):6033. doi: 10.3390/s20216033.
6
Deep Learning Feature Extraction Approach for Hematopoietic Cancer Subtype Classification.深度学习特征提取方法在血液肿瘤亚型分类中的应用。
Int J Environ Res Public Health. 2021 Feb 23;18(4):2197. doi: 10.3390/ijerph18042197.
7
Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions.使用 Shapley 值解释机器学习模型:在化合物效力和多靶点活性预测中的应用。
J Comput Aided Mol Des. 2020 Oct;34(10):1013-1026. doi: 10.1007/s10822-020-00314-0. Epub 2020 May 2.
8
CNNDLP: A Method Based on Convolutional Autoencoder and Convolutional Neural Network with Adjacent Edge Attention for Predicting lncRNA-Disease Associations.CNNDLP:一种基于卷积自动编码器和卷积神经网络的方法,具有相邻边缘注意力,用于预测 lncRNA-疾病关联。
Int J Mol Sci. 2019 Aug 30;20(17):4260. doi: 10.3390/ijms20174260.
9
Semi Supervised Learning with Deep Embedded Clustering for Image Classification and Segmentation.用于图像分类和分割的深度嵌入聚类半监督学习
IEEE Access. 2019;7:11093-11104. doi: 10.1109/ACCESS.2019.2891970. Epub 2019 Jan 9.
10
Representation learning for mammography mass lesion classification with convolutional neural networks.基于卷积神经网络的乳腺钼靶肿块病变分类的表征学习
Comput Methods Programs Biomed. 2016 Apr;127:248-57. doi: 10.1016/j.cmpb.2015.12.014. Epub 2016 Jan 7.

引用本文的文献

1
A systematic review of deep learning methods for community detection in social networks.社交网络中社区检测的深度学习方法的系统综述。
Front Artif Intell. 2025 Aug 22;8:1572645. doi: 10.3389/frai.2025.1572645. eCollection 2025.
2
SNVstory: inferring genetic ancestry from genome sequencing data.SNVstory:从基因组测序数据推断遗传起源。
BMC Bioinformatics. 2024 Feb 20;25(1):76. doi: 10.1186/s12859-024-05703-y.
3
A Study and Analysis of Disease Identification using Genomic Sequence Processing Models: An Empirical Review.
使用基因组序列处理模型进行疾病识别的研究与分析:实证综述
Curr Genomics. 2023 Dec 12;24(4):207-235. doi: 10.2174/0113892029269523231101051455.
4
Reference flow: reducing reference bias using multiple population genomes.参考文献流向:利用多个群体基因组减少参考文献偏差。
Genome Biol. 2021 Jan 4;22(1):8. doi: 10.1186/s13059-020-02229-3.