Rao Suyog, Rodriguez Alfredo, Benson Gary
Department of Electrical and Computer Engineering, Boston University, Boston, MA, USA.
Genome Inform. 2005;16(1):3-12.
Tandem repeats are an important class of DNA repeats and much research has focused on their efficient identification, their use in DNA typing and fingerprinting, and their causative role in trinucleotide repeat diseases such as Huntington Disease, myotonic dystrophy, and Fragile-X mental retardation. We are interested in clustering tandem repeats into groups or families based on sequence similarity so that their biological importance may be further explored. To cluster tandem repeats we need a notion of pairwise distance which we obtain by alignment. In this paper we evaluate five distance functions used to produce those alignments--Consensus, Euclidean, Jensen-Shannon Divergence, Entropy-Surface, and Entropy-weighted. It is important to analyze and compare these functions because the choice of distance metric forms the core of any clustering algorithm. We employ a novel method to compare alignments and thereby compare the distance functions themselves. We rank the distance functions based on the cluster validation techniques--Average Cluster Density and Average Silhouette Width. Finally, we propose a multi-phase clustering method which produces good-quality clusters. In this study, we analyze clusters of tandem repeats from five sequences: Human Chromosomes 3, 5, 10 and X and C. elegans Chromosome III.
串联重复序列是一类重要的DNA重复序列,许多研究都集中在它们的高效识别、在DNA分型和指纹识别中的应用,以及它们在三核苷酸重复疾病(如亨廷顿舞蹈症、强直性肌营养不良和脆性X智力低下)中的致病作用。我们感兴趣的是根据序列相似性将串联重复序列聚类成组或家族,以便进一步探索它们的生物学重要性。为了对串联重复序列进行聚类,我们需要通过比对获得的成对距离的概念。在本文中,我们评估了用于生成这些比对的五个距离函数——一致性、欧几里得、詹森-香农散度、熵表面和熵加权。分析和比较这些函数很重要,因为距离度量的选择构成了任何聚类算法的核心。我们采用一种新颖的方法来比较比对,从而比较距离函数本身。我们根据聚类验证技术——平均聚类密度和平均轮廓宽度对距离函数进行排序。最后,我们提出了一种多阶段聚类方法,该方法能产生高质量的聚类。在本研究中,我们分析了来自五个序列的串联重复序列聚类:人类染色体3、5、10和X以及秀丽隐杆线虫染色体III。