Collins Jack R, Stephens Robert M, Gold Bert, Long Bill, Dean Michael, Burt Stanley K
Advanced Biomedical Computing Center, NCI-Frederick, Frederick, MD, USA.
Genomics. 2003 Jul;82(1):10-9. doi: 10.1016/s0888-7543(03)00076-4.
The current pace of the generation of sequence data requires the development of software tools that can rapidly provide full annotation of the data. We have developed a new method for rapid sequence comparison using the exact match algorithm without repeat masking. As a demonstration, we have identified all perfect simple tandem repeats (STR) within the draft sequence of the human genome. The STR elements (chromosome, position, length and repeat subunit) have been placed into a relational database. Repeat flanking sequence is also publicly accessible at http://grid.abcc.ncifcrf.gov. To illustrate the utility of this complete set of STR elements, we documented the increased density of potentially polymorphic markers throughout the genome. The new STR markers may be useful in disease association studies because so many STR elements manifest multiallelic polymorphism. Also, because triplet repeat expansions are important for human disease etiology, we identified trinucleotide repeats that exist within exons of known genes. This resulted in a list that includes all 14 genes known to undergo polynucleotide expansion, and 48 additional candidates. Several of these are non-polyglutamine triplet repeats. Other examinations of the STR database demonstrated repeats spanning splice junctions and identified SNPs within repeat elements.
当前序列数据的生成速度要求开发能够快速提供数据完整注释的软件工具。我们开发了一种新方法,使用精确匹配算法且不进行重复序列屏蔽来进行快速序列比较。作为演示,我们在人类基因组草图序列中识别出了所有完美的简单串联重复序列(STR)。这些STR元件(染色体、位置、长度和重复亚基)已被放入一个关系数据库中。重复序列侧翼序列也可在http://grid.abcc.ncifcrf.gov上公开获取。为了说明这整套STR元件的实用性,我们记录了整个基因组中潜在多态性标记密度的增加。新的STR标记可能在疾病关联研究中有用,因为如此多的STR元件表现出多等位基因多态性。此外,由于三联体重复序列扩增对人类疾病病因学很重要,我们识别出了已知基因外显子内存在的三核苷酸重复序列。这产生了一个列表,其中包括所有已知会发生多核苷酸扩增的14个基因,以及另外48个候选基因。其中几个是非聚谷氨酰胺三联体重复序列。对STR数据库的其他检查显示了跨越剪接位点的重复序列,并在重复元件内识别出了单核苷酸多态性(SNP)。