Suppr超能文献

宏基因组序列的无监督双向聚类

Unsupervised two-way clustering of metagenomic sequences.

作者信息

Prabhakara Shruthi, Acharya Raj

机构信息

Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA 16802, USA.

出版信息

J Biomed Biotechnol. 2012;2012:153647. doi: 10.1155/2012/153647. Epub 2012 Apr 5.

Abstract

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. The efficacy of clustering methods depends on the number of reads in the dataset, the read length and relative abundances of source genomes in the microbial community. In this paper, we formulate an unsupervised naive Bayes multispecies, multidimensional mixture model for reads from a metagenome. We use the proposed model to cluster metagenomic reads by their species of origin and to characterize the abundance of each species. We model the distribution of word counts along a genome as a Gaussian for shorter, frequent words and as a Poisson for longer words that are rare. We employ either a mixture of Gaussians or mixture of Poissons to model reads within each bin. Further, we handle the high-dimensionality and sparsity associated with the data, by grouping the set of words comprising the reads, resulting in a two-way mixture model. Finally, we demonstrate the accuracy and applicability of this method on simulated and real metagenomes. Our method can accurately cluster reads as short as 100 bps and is robust to varying abundances, divergences and read lengths.

摘要

宏基因组学面临的一个主要挑战是开发工具,用于表征大量短宏基因组读数的功能和分类内容。聚类方法的有效性取决于数据集中的读数数量、读数长度以及微生物群落中源基因组的相对丰度。在本文中,我们为宏基因组读数构建了一个无监督的朴素贝叶斯多物种、多维混合模型。我们使用所提出的模型按宏基因组读数的来源物种对其进行聚类,并表征每个物种的丰度。对于较短且频繁出现的单词,我们将基因组上单词计数的分布建模为高斯分布;对于较长且罕见的单词,则建模为泊松分布。我们采用高斯混合模型或泊松混合模型对每个箱内的读数进行建模。此外,我们通过对构成读数的单词集进行分组来处理与数据相关的高维度和稀疏性问题,从而得到一个双向混合模型。最后,我们在模拟和真实宏基因组上展示了该方法的准确性和适用性。我们的方法能够准确地对短至100个碱基对的读数进行聚类,并且对不同的丰度、差异和读数长度具有鲁棒性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/103e/3336163/fb82d38612ab/JBB2012-153647.001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验