基于后缀数组的基因组序列分析的进化见解。

Evolutionary insights from suffix array-based genome sequence analysis.

作者信息

Poddar Anindya, Chandra Nagasuma, Ganapathiraju Madhavi, Sekar K, Klein-Seetharaman Judith, Reddy Raj, Balakrishnan N

机构信息

Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore 560 012, India.

出版信息

J Biosci. 2007 Aug;32(5):871-81. doi: 10.1007/s12038-007-0087-z.

DOI:10.1007/s12038-007-0087-z

PMID:17914229

Abstract

Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples. The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a 'meaning' for tetra and higher n-grams. The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG,coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.

摘要

基因和蛋白质序列分析作为现代生物学研究的核心组成部分，很容易采用字符串匹配和模式识别算法。更高效、全面地分析全基因组序列的需求不断增长，催生了新的计算方法。后缀树和后缀数组是数据结构，在许多其他领域广为人知，也非常适合序列分析。在此，我们报告了后缀数组构建设计的一项改进。通过实际示例展示了这种方法在通用性和可扩展性方面的提升。该算法对全基因组的可扩展性使其适用于解决许多生物学上有趣的问题。一个例子是通过分析单字、双字和更高阶的n字所获得的进化见解，这表明遗传密码对基因组的整体组成有直接影响。此外，还对不同的蛋白质组进行了分析，以了解可能的肽空间覆盖情况，结果表明，在原核生物中，四肽水平上多达四分之一的总空间未被采样，尽管在一个蛋白质组的一种或另一种蛋白质中几乎可以看到所有的三肽。此外，特定四肽及更高阶肽的计数开始呈现出不同的模式，这表明四肽及更高阶n字具有“意义”。该工具包还被用于证明在全蛋白质组中高效识别重复序列的有用性。例如，已发现由结核分枝杆菌H37Rv基因组编码的一个COG的16个成员含有一个300个氨基酸的重复序列。