Landès C, Hénaut A, Risler J L
Centre de Génétique Moléculaire du CNRS, Laboratoire Associé à l'Université Gif-sur-Yvette, France.
Comput Appl Biosci. 1993 Apr;9(2):191-6. doi: 10.1093/bioinformatics/9.2.191.
A method aimed at classifying protein sequences without resorting to pairwise alignment is presented. Called DOCMA (DOt-plot Comparisons by Multivariate Analysis), it is based on a multivariate analysis of the pairwise dot-plots between all the sequences in the set. The dot-plots are first simplified by considering only the projections of the 'diagonal' segments of similarity onto the axes. From these projections a data matrix is built, in which each column is representative of the comparisons of one given sequence with all the other ones. This data matrix is then transformed into a distance matrix by a chi-squared analysis, from which the coordinates of the sequences in an orthonormal Euclidean space are obtained. The sequences are finally classified by a dynamic clustering procedure followed by a search for strong clusters. Application of this method to protein families such as the globins, the cytochromes c and the aminoacyl-tRNA synthetases shows that it is quite effective in delineating subgroups that contain even distantly related sequences.
本文提出了一种无需借助成对序列比对来对蛋白质序列进行分类的方法。该方法称为DOCMA(通过多变量分析进行点阵图比较),它基于对数据集中所有序列之间的成对点阵图进行多变量分析。首先,通过仅考虑相似性“对角线”片段在坐标轴上的投影来简化点阵图。从这些投影构建一个数据矩阵,其中每一列代表一个给定序列与所有其他序列的比较。然后,通过卡方分析将该数据矩阵转换为距离矩阵,从中获得序列在正交欧几里得空间中的坐标。最后,通过动态聚类程序对序列进行分类,随后搜索强聚类。将该方法应用于球蛋白、细胞色素c和氨酰-tRNA合成酶等蛋白质家族,结果表明它在划分包含甚至远缘相关序列的亚组方面相当有效。