Reneker Jeff, Shyu Chi-Ren
Department of Computer Science, University of Missouri, Columbia, USA.
BMC Bioinformatics. 2005 May 3;6:111. doi: 10.1186/1471-2105-6-111.
Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches -- most of which are irrelevant to the researcher.
We present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes -- without the main memory constraints present in other systems.
We present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request.
搜索小串联/分散重复DNA序列可简化许多生物医学研究过程。例如,酵母中的全基因组阵列分析揭示了22个受PHO调控的基因。除其中一个基因外,其他所有基因的启动子区域都至少包含两个核心Pho4p结合位点(CACGTG和CACGTT)中的一个。在人类中,微卫星在一些罕见的神经退行性疾病中起作用,如1型脊髓小脑共济失调(SCA1)。SCA1是一种遗传性神经退行性疾病,由该基因编码序列中CAG重复序列的扩增引起。在细菌病原体中,微卫星被认为可调节某些毒力因子的表达。例如,细菌通常通过相变产生菌株内多样性,这与毒力决定因素密切相关。最近对幽门螺杆菌菌株26695和J99的完整序列分析通过与同聚物序列和二核苷酸重复序列的关联,在这两个基因组中鉴定出46个推定的相变可变基因。生命科学家对研究小DNA序列的功能越来越感兴趣。然而,当前的搜索算法通常会产生数千个匹配项——其中大多数与研究人员无关。
我们展示了用于在多个基因组中定位小DNA序列的哈希函数和搜索算法。我们的系统应用信息检索算法来发现重复序列的跨物种保守性知识。我们讨论了将基因本体论(GO)数据库纳入这些算法的情况。我们对系统针对各种重复序列长度进行了详尽的时间分析。例如,在49条不同染色体上的3.224千兆碱基中搜索8个碱基的序列平均需要1.147秒。为了说明搜索结果的相关性,我们对酵母Pho4p结合位点CACGTG和CACGTT进行了添加和不添加注释项的搜索。此外,还进行了跨物种搜索,以说明如何快速识别基因组数据中潜在的隐藏相关性。一个物种中的发现可作为在另一个物种中发现新事物的催化剂。这些实验还表明,我们的系统在搜索多个基因组时表现良好——不存在其他系统中存在的主内存限制。
我们提出了一种高效的算法来定位小DNA片段,并同时搜索序列附带的注释数据。全基因组范围内对短序列的搜索通常会返回数百个命中结果。我们的实验表明,随后搜索注释数据可以为用户细化和聚焦结果。我们的算法在主内存需求方面也具有空间效率。可根据要求提供源代码。