Suppr超能文献

解析软宏基因组聚类

Disambiguating a Soft Metagenomic Clustering.

作者信息

Nihalani Rahul, Zola Jaroslaw, Aluru Srinivas

机构信息

Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA.

Department of Computer Science and Engineering, University at Buffalo, Buffalo, New York, USA.

出版信息

J Comput Biol. 2025 May;32(5):473-485. doi: 10.1089/cmb.2024.0825. Epub 2025 Mar 7.

Abstract

Clustering is a popular technique used for analyzing amplicon sequencing data in metagenomics. Specifically, it is used to assign sequences () to clusters, each cluster representing a species or a higher level taxonomic unit. Reads from multiple species often sharing subsequences, combined with lack of a perfect similarity measure, make it difficult to correctly assign reads to clusters. Thus, metagenomic clustering methods must either resort to ambiguity, or make the best available choice at each read assignment stage, which could lead to incorrect clusters and potentially cascading errors. In this article, we argue for first generating an ambiguous clustering and then resolving the ambiguities collectively by analyzing the ambiguous clusters. We propose a rigorous formulation of this problem and show that it is -Hard. We then propose an efficient heuristic to solve it in practice. We validate our approach on several synthetically generated datasets and two datasets consisting of 16S rDNA sequences from the microbiome of rat guts.

摘要

聚类是一种在宏基因组学中用于分析扩增子测序数据的常用技术。具体而言,它用于将序列()分配到各个簇中,每个簇代表一个物种或更高层次的分类单元。来自多个物种的读段常常共享子序列,再加上缺乏完美的相似性度量,使得难以将读段正确地分配到簇中。因此,宏基因组聚类方法要么诉诸模糊性,要么在每次读段分配阶段做出最佳可行选择,这可能导致错误的簇以及潜在的级联错误。在本文中,我们主张首先生成一个模糊聚类,然后通过分析这些模糊簇来集体解决模糊性问题。我们对这个问题提出了一个严谨的公式化表述,并表明它是NP难的。然后我们提出一种有效的启发式方法以便在实际中解决它。我们在几个合成生成的数据集以及两个由大鼠肠道微生物组的16S rDNA序列组成的数据集上验证了我们的方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验