Suppr超能文献

将 DEPP 系统发育定位扩展到超大规模参考树:一种基于树的集成方法。

Scaling DEPP phylogenetic placement to ultra-large reference trees: a tree-aware ensemble approach.

机构信息

Electrical and Computer Engineering Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States.

Pediatrics Department, University of California San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, United States.

出版信息

Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae361.

Abstract

MOTIVATION

Phylogenetic placement of a query sequence on a backbone tree is increasingly used across biomedical sciences to identify the content of a sample from its DNA content. The accuracy of such analyses depends on the density of the backbone tree, making it crucial that placement methods scale to very large trees. Moreover, a new paradigm has been recently proposed to place sequences on the species tree using single-gene data. The goal is to better characterize the samples and to enable combined analyses of marker-gene (e.g., 16S rRNA gene amplicon) and genome-wide data. The recent method DEPP enables performing such analyses using metric learning. However, metric learning is hampered by a need to compute and save a quadratically growing matrix of pairwise distances during training. Thus, the training phase of DEPP does not scale to more than roughly 10 000 backbone species, a problem that we faced when trying to use our recently released Greengenes2 (GG2) reference tree containing 331 270 species.

RESULTS

This paper explores divide-and-conquer for training ensembles of DEPP models, culminating in a method called C-DEPP. While divide-and-conquer has been extensively used in phylogenetics, applying divide-and-conquer to data-hungry machine-learning methods needs nuance. C-DEPP uses carefully crafted techniques to enable quasi-linear scaling while maintaining accuracy. C-DEPP enables placing 20 million 16S fragments on the GG2 reference tree in 41 h of computation.

AVAILABILITY AND IMPLEMENTATION

The dataset and C-DEPP software are freely available at https://github.com/yueyujiang/dataset_cdepp/.

摘要

动机

在生物医学科学中,越来越多地将查询序列在主干树上的系统发育位置用于根据其 DNA 内容识别样本的内容。此类分析的准确性取决于主干树的密度,因此放置方法必须扩展到非常大的树至关重要。此外,最近提出了一种新的范例,即使用单基因数据将序列放置在种系树上。其目标是更好地描述样本,并能够对标记基因(例如 16S rRNA 基因扩增子)和全基因组数据进行联合分析。最近的 DEPP 方法使用度量学习来执行此类分析。但是,度量学习受到在训练过程中需要计算和保存二次增长的成对距离矩阵的限制。因此,DEPP 的训练阶段无法扩展到超过大约 10000 个主干种,当我们尝试使用最近发布的包含 331270 个种的 Greengenes2(GG2)参考树时,就会遇到这个问题。

结果

本文探讨了用于训练 DEPP 模型集合的分而治之方法,最终得到了一种称为 C-DEPP 的方法。虽然分而治之在系统发生学中得到了广泛的应用,但将分而治之应用于数据密集型机器学习方法需要注意细微差别。C-DEPP 使用精心设计的技术来实现准线性扩展,同时保持准确性。C-DEPP 可在 41 小时的计算时间内将 2000 万个 16S 片段放置在 GG2 参考树上。

可用性和实现

数据集和 C-DEPP 软件可在 https://github.com/yueyujiang/dataset_cdepp/ 上免费获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f76b/11193062/97cec91236c4/btae361f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验