Suppr超能文献

一种利用在高维向量空间中定义的相关肽基序构建的全面全基因组细菌系统发育树。

A comprehensive whole genome bacterial phylogeny using correlated peptide motifs defined in a high dimensional vector space.

作者信息

Stuart Gary W, Berry Michael W

机构信息

Department of Life Sciences, Indiana State University, Terre Haute, IN 47809, USA.

出版信息

J Bioinform Comput Biol. 2003 Oct;1(3):475-93. doi: 10.1142/s0219720003000265.

Abstract

As whole genome sequences continue to expand in number and complexity, effective methods for comparing and categorizing both genes and species represented within extremely large datasets are required. Methods introduced to date have generally utilized incomplete and likely insufficient subsets of the available data. We have developed an accurate and efficient method for producing robust gene and species phylogenies using very large whole genome protein datasets. This method relies on multidimensional protein vector definitions supplied by the singular value decomposition (SVD) of a large sparse data matrix in which each protein is uniquely represented as a vector of overlapping tetrapeptide frequencies. Quantitative pairwise estimates of species similarity were obtained by summing the protein vectors to form species vectors, then determining the cosines of the angles between species vectors. Evolutionary trees produced using this method confirmed many accepted prokaryotic relationships. However, several unconventional relationships were also noted. In addition, we demonstrate that many of the SVD-derived right basis vectors represent particular conserved protein families, while many of the corresponding left basis vectors describe conserved motifs within these families as sets of correlated peptides (copeps). This analysis represents the most detailed simultaneous comparison of prokaryotic genes and species available to date.

摘要

随着全基因组序列在数量和复杂性上不断增加,需要有有效的方法来对超大型数据集中所代表的基因和物种进行比较和分类。迄今为止所引入的方法通常利用的是可用数据中不完整且可能不充分的子集。我们开发了一种准确且高效的方法,可利用超大型全基因组蛋白质数据集构建可靠的基因和物种系统发育树。该方法依赖于由大型稀疏数据矩阵的奇异值分解(SVD)提供的多维蛋白质向量定义,其中每个蛋白质都被唯一地表示为重叠四肽频率的向量。通过对蛋白质向量求和以形成物种向量,然后确定物种向量之间夹角的余弦值,获得了物种相似性的定量成对估计值。使用此方法生成的进化树证实了许多已被认可的原核生物关系。然而,也注意到了一些非常规关系。此外,我们证明许多由SVD衍生的右基向量代表特定的保守蛋白质家族,而许多相应的左基向量将这些家族中的保守基序描述为相关肽集(copeps)。该分析代表了迄今为止对原核生物基因和物种进行的最详细的同步比较。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验