Wren Jonathan D, Johnson David, Gruenwald Le
Advanced Center for Genome Technology, Department of Botany and Microbiology, 101 David L, Boren Blvd, Rm 2025.
BMC Bioinformatics. 2005 Jul 15;6 Suppl 2(Suppl 2):S2. doi: 10.1186/1471-2105-6-S2-S2.
There is an enormous amount of information encoded in each genome--enough to create living, responsive and adaptive organisms. Raw sequence data alone is not enough to understand function, mechanisms or interactions. Changes in a single base pair can lead to disease, such as sickle-cell anemia, while some large megabase deletions have no apparent phenotypic effect. Genomic features are varied in their data types and annotation of these features is spread across multiple databases. Herein, we develop a method to automate exploration of genomes by iteratively exploring sequence data for correlations and building upon them. First, to integrate and compare different annotation sources, a sequence matrix (SM) is developed to contain position-dependant information. Second, a classification tree is developed for matrix row types, specifying how each data type is to be treated with respect to other data types for analysis purposes. Third, correlative analyses are developed to analyze features of each matrix row in terms of the other rows, guided by the classification tree as to which analyses are appropriate. A prototype was developed and successful in detecting coinciding genomic features among genes, exons, repetitive elements and CpG islands.
每个基因组中都编码了大量信息——足以创造出有生命、有反应和适应性的生物体。仅原始序列数据不足以理解功能、机制或相互作用。单个碱基对的变化可能导致疾病,如镰状细胞贫血,而一些大的兆碱基缺失却没有明显的表型效应。基因组特征的数据类型各不相同,并且这些特征的注释分布在多个数据库中。在此,我们开发了一种方法,通过迭代探索序列数据以寻找相关性并在此基础上进行构建,从而自动探索基因组。首先,为了整合和比较不同的注释来源,开发了一个序列矩阵(SM)来包含位置相关信息。其次,为矩阵行类型开发了一个分类树,指定了为分析目的每种数据类型相对于其他数据类型应如何处理。第三,开发了相关分析,以根据分类树指导的其他行来分析每个矩阵行的特征,确定哪些分析是合适的。开发了一个原型,并成功检测到基因、外显子、重复元件和CpG岛之间一致的基因组特征。