Department of Electrical and Computer Engineering, UC San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA.
Bioinformatics and Systems Biology Graduate Program, UC San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA.
Syst Biol. 2023 May 19;72(1):17-34. doi: 10.1093/sysbio/syac031.
Placing new sequences onto reference phylogenies is increasingly used for analyzing environmental samples, especially microbiomes. Existing placement methods assume that query sequences have evolved under specific models directly on the reference phylogeny. For example, they assume single-gene data (e.g., 16S rRNA amplicons) have evolved under the GTR model on a gene tree. Placement, however, often has a more ambitious goal: extending a (genome-wide) species tree given data from individual genes without knowing the evolutionary model. Addressing this challenging problem requires new directions. Here, we introduce Deep-learning Enabled Phylogenetic Placement (DEPP), an algorithm that learns to extend species trees using single genes without prespecified models. In simulations and on real data, we show that DEPP can match the accuracy of model-based methods without any prior knowledge of the model. We also show that DEPP can update the multilocus microbial tree-of-life with single genes with high accuracy. We further demonstrate that DEPP can combine 16S and metagenomic data onto a single tree, enabling community structure analyses that take advantage of both sources of data. [Deep learning; gene tree discordance; metagenomics; microbiome analyses; neural networks; phylogenetic placement.].
将新序列放置到参考系统发育树上的方法越来越多地用于分析环境样本,特别是微生物组。现有的放置方法假设查询序列是在参考系统发育树上的特定模型下直接进化的。例如,它们假设单基因数据(例如 16S rRNA 扩增子)是在基因树上的 GTR 模型下进化的。然而,放置通常有更雄心勃勃的目标:在不知道进化模型的情况下,根据来自单个基因的数据扩展(全基因组)种系树。解决这个具有挑战性的问题需要新的方向。在这里,我们引入了深度学习支持的系统发育放置(DEPP)算法,这是一种使用没有先验模型的单个基因来学习扩展种系树的算法。在模拟和真实数据中,我们表明 DEPP 可以在没有任何模型先验知识的情况下匹配基于模型方法的准确性。我们还表明,DEPP 可以使用单个基因以高精度更新微生物多基因树。我们进一步证明,DEPP 可以将 16S 和宏基因组数据合并到单个树上,从而能够进行利用这两种数据来源的群落结构分析。[深度学习;基因树分歧;宏基因组学;微生物组分析;神经网络;系统发育放置。]