• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于机器学习的填补技术,用于从不完全距离矩阵估计系统发育树。

Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices.

机构信息

Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.

Department of Computer Science and Engineering, Eastern University, Dhaka, Bangladesh.

出版信息

BMC Genomics. 2020 Jul 20;21(1):497. doi: 10.1186/s12864-020-06892-5.

DOI:10.1186/s12864-020-06892-5
PMID:32689946
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7370488/
Abstract

BACKGROUND

With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data.

RESULTS

We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data.

CONCLUSIONS

This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances .

摘要

背景

随着新测序基因组的快速增长,从整个基因组中采样的基因推断种系进化树已成为比较和进化生物学的基本任务。然而,在利用这些大规模分子数据方面仍然存在重大挑战。其中最主要的挑战之一是开发能够处理缺失数据的有效方法。流行的基于距离的方法,如 NJ(邻接法)和 UPGMA(算术平均未加权对组法),需要没有任何缺失数据的完整距离矩阵。

结果

我们引入了两种基于机器学习的高度准确的距离填补技术。这些方法基于矩阵分解和基于自动编码器的深度学习架构。我们在一系列模拟和生物数据集上评估了这两种方法。实验结果表明,我们提出的方法与最佳替代距离填补技术相匹配或有所改进。此外,这些方法可扩展到具有数百个分类单元的大型数据集,并可以处理大量缺失数据。

结论

这项研究首次展示了应用深度学习技术填补距离矩阵的强大功能和可行性。因此,这项研究在存在缺失数据的情况下推进了系统发育树构建的最新技术。所提出的方法可在 https://github.com/Ananya-Bhattacharjee/ImputeDistances 上以开源形式获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9681/7370488/4e81eefc1b5b/12864_2020_6892_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9681/7370488/0551c42ecbe4/12864_2020_6892_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9681/7370488/193c0f449ecb/12864_2020_6892_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9681/7370488/4e81eefc1b5b/12864_2020_6892_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9681/7370488/0551c42ecbe4/12864_2020_6892_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9681/7370488/193c0f449ecb/12864_2020_6892_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9681/7370488/4e81eefc1b5b/12864_2020_6892_Fig3_HTML.jpg

相似文献

1
Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices.基于机器学习的填补技术,用于从不完全距离矩阵估计系统发育树。
BMC Genomics. 2020 Jul 20;21(1):497. doi: 10.1186/s12864-020-06892-5.
2
imPhy: Imputing Phylogenetic Trees with Missing Information Using Mathematical Programming.imPhy:使用数学规划推断具有缺失信息的系统发育树。
IEEE/ACM Trans Comput Biol Bioinform. 2020 Jul-Aug;17(4):1222-1230. doi: 10.1109/TCBB.2018.2884459. Epub 2018 Nov 30.
3
Fast NJ-like algorithms to deal with incomplete distance matrices.用于处理不完整距离矩阵的类似快速NJ的算法。
BMC Bioinformatics. 2008 Mar 26;9:166. doi: 10.1186/1471-2105-9-166.
4
PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data.PhyloMissForest:一种带有缺失数据的构建系统发育树的随机森林框架。
BMC Genomics. 2022 May 18;23(1):377. doi: 10.1186/s12864-022-08540-6.
5
Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data.基于四重奏的深度学习基因树推断在存在缺失数据的情况下仍能改进系统发育基因组分析。
J Comput Biol. 2022 Nov;29(11):1156-1172. doi: 10.1089/cmb.2022.0212. Epub 2022 Sep 1.
6
Estimating species trees from unrooted gene trees.从无根基因树估计物种树。
Syst Biol. 2011 Oct;60(5):661-7. doi: 10.1093/sysbio/syr027. Epub 2011 Mar 28.
7
Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences.基于整个质体和整个线粒体基因组序列推断的基因组BLAST距离系统发育树。
BMC Bioinformatics. 2006 Jul 19;7:350. doi: 10.1186/1471-2105-7-350.
8
Invariant transformers of Robinson and Foulds distance matrices for Convolutional Neural Network.不变的 Robinson 和 Foulds 距离矩阵变换用于卷积神经网络。
J Bioinform Comput Biol. 2022 Aug;20(4):2250012. doi: 10.1142/S0219720022500123. Epub 2022 Jul 6.
9
Deep learning based decision tree ensembles for incomplete medical datasets.基于深度学习的决策树集成算法在不完全医疗数据集上的应用。
Technol Health Care. 2024;32(1):75-87. doi: 10.3233/THC-220514.
10
Relaxed neighbor joining: a fast distance-based phylogenetic tree construction method.宽松邻接法:一种基于距离的快速系统发育树构建方法。
J Mol Evol. 2006 Jun;62(6):785-92. doi: 10.1007/s00239-005-0176-2. Epub 2006 Apr 28.

引用本文的文献

1
PhyloTune: An efficient method to accelerate phylogenetic updates using a pretrained DNA language model.PhyloTune:一种使用预训练DNA语言模型加速系统发育更新的有效方法。
Nat Commun. 2025 Jul 26;16(1):6905. doi: 10.1038/s41467-025-61684-3.
2
Opportunities and Challenges in Applying AI to Evolutionary Morphology.将人工智能应用于进化形态学的机遇与挑战。
Integr Org Biol. 2024 Sep 23;6(1):obae036. doi: 10.1093/iob/obae036. eCollection 2024.
3
Machine learning solutions for integrating partially overlapping genetic datasets and modelling host-endophyte effects in ryegrass () dry matter yield estimation.

本文引用的文献

1
SPECIES TREE INFERENCE FROM GENOMIC SEQUENCES USING THE LOG-DET DISTANCE.利用对数行列式距离从基因组序列推断物种树
SIAM J Appl Algebr Geom. 2019;3(1):107-127. doi: 10.1137/18m1194134. Epub 2019 Mar 14.
2
STELAR: a statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency.STELAR:一种基于最大三重一致性的统计一致的合并物种树估计方法。
BMC Genomics. 2020 Feb 10;21(1):136. doi: 10.1186/s12864-020-6519-y.
3
APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments.
用于整合部分重叠遗传数据集并模拟黑麦草宿主-内生菌效应以估计干物质产量的机器学习解决方案。
Front Plant Sci. 2025 May 6;16:1543956. doi: 10.3389/fpls.2025.1543956. eCollection 2025.
4
Artificial intelligence and bioinformatics: a journey from traditional techniques to smart approaches.人工智能与生物信息学:从传统技术到智能方法的历程。
Gastroenterol Hepatol Bed Bench. 2024;17(3):241-252. doi: 10.22037/ghfbb.v17i3.2977.
5
A Lysogenic Bacteriophage Crossing the Antarctic and Arctic, Representing a New Genus of .一种溶原性噬菌体跨越南极和北极,代表了. 的一个新属。
Int J Mol Sci. 2023 Apr 21;24(8):7662. doi: 10.3390/ijms24087662.
6
Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model.在无链偏性模型下的全基因组无比对系统发育距离估计
Bioinform Adv. 2022 Aug 12;2(1):vbac055. doi: 10.1093/bioadv/vbac055. eCollection 2022.
7
PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data.PhyloMissForest:一种带有缺失数据的构建系统发育树的随机森林框架。
BMC Genomics. 2022 May 18;23(1):377. doi: 10.1186/s12864-022-08540-6.
8
Using an Unsupervised Clustering Model to Detect the Early Spread of SARS-CoV-2 Worldwide.利用无监督聚类模型检测 SARS-CoV-2 在全球范围内的早期传播。
Genes (Basel). 2022 Apr 7;13(4):648. doi: 10.3390/genes13040648.
9
Current progress and open challenges for applying deep learning across the biosciences.深度学习在整个生命科学中的应用现状及面临的开放性挑战。
Nat Commun. 2022 Apr 1;13(1):1728. doi: 10.1038/s41467-022-29268-7.
10
Novel metric for hyperbolic phylogenetic tree embeddings.双曲系统发生树嵌入的新度量。
Biol Methods Protoc. 2021 Mar 27;6(1):bpab006. doi: 10.1093/biomethods/bpab006. eCollection 2021.
APPLS:基于距离的可扩展系统发育排列,无需或需进行比对。
Syst Biol. 2020 May 1;69(3):566-578. doi: 10.1093/sysbio/syz063.
4
AutoImpute: Autoencoder based imputation of single-cell RNA-seq data.AutoImpute:基于自动编码器的单细胞 RNA-seq 数据插补。
Sci Rep. 2018 Nov 5;8(1):16329. doi: 10.1038/s41598-018-34688-x.
5
Imputing missing distances in molecular phylogenetics.在分子系统发育学中估算缺失距离
PeerJ. 2018 Jul 24;6:e5321. doi: 10.7717/peerj.5321. eCollection 2018.
6
MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms.MEGA X:跨越计算平台的分子进化遗传学分析。
Mol Biol Evol. 2018 Jun 1;35(6):1547-1549. doi: 10.1093/molbev/msy096.
7
DAMBE7: New and Improved Tools for Data Analysis in Molecular Biology and Evolution.DAMBE7:用于分子生物学和进化数据分析的新改进工具。
Mol Biol Evol. 2018 Jun 1;35(6):1550-1552. doi: 10.1093/molbev/msy073.
8
OCTAL: Optimal Completion of gene trees in polynomial time.OCTAL:多项式时间内基因树的最优完成
Algorithms Mol Biol. 2018 Mar 15;13:6. doi: 10.1186/s13015-018-0124-5. eCollection 2018.
9
Precision Oncology beyond Targeted Therapy: Combining Omics Data with Machine Learning Matches the Majority of Cancer Cells to Effective Therapeutics.精准肿瘤学超越靶向治疗:将组学数据与机器学习相结合,使大多数癌细胞与有效的治疗方法相匹配。
Mol Cancer Res. 2018 Feb;16(2):269-278. doi: 10.1158/1541-7786.MCR-17-0378. Epub 2017 Nov 13.
10
Deep Learning-Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer.基于深度学习的多组学整合可稳健预测肝癌患者的生存情况。
Clin Cancer Res. 2018 Mar 15;24(6):1248-1259. doi: 10.1158/1078-0432.CCR-17-0853. Epub 2017 Oct 5.