Benson G
Department of Biomathematical Sciences, Mount Sinai School of Medicine, New York, NY 10029-6574, USA.
Nucleic Acids Res. 1999 Jan 15;27(2):573-80. doi: 10.1093/nar/27.2.573.
A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm's speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human beta T cellreceptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface atc3.biomath.mssm.edu/trf.html has been established for automated use of the program.
DNA中的串联重复是指核苷酸模式的两个或更多个相邻的、近似的拷贝。串联重复已被证明会导致人类疾病,可能发挥多种调节和进化作用,并且是重要的实验室和分析工具。由于无法在基因组序列数据中轻松检测到串联重复,关于其模式大小、拷贝数、突变历史等方面的广泛知识一直受到限制。在本文中,我们提出了一种用于寻找串联重复的新算法,该算法无需指定模式或模式大小即可运行。我们通过相邻模式拷贝之间的同一性百分比和插入缺失频率对串联重复进行建模,并使用基于统计的识别标准。通过分析四条序列:人类铁调素基因、人类β T细胞受体基因座序列和两条酵母染色体,我们证明了该算法的速度以及检测经历了广泛突变变化的串联重复的能力。这些序列的大小从3 kb到700 kb不等。已在c3.biomath.mssm.edu/trf.html建立了一个万维网服务器界面,以便自动使用该程序。