Park J, Teichmann S A
MRC Laboratory of Molecular Biology, Cambridge, UK.
Bioinformatics. 1998;14(2):144-50. doi: 10.1093/bioinformatics/14.2.144.
Large-scale determination of relationships between the proteins produced by genome sequences is now common. All protein sequences are matched and those that have high match scores are clustered into families. In cases where the proteins are built of several domains or duplication modules, this can lead to misleading results. Consider the very simple example of three proteins: 1, formed by duplication modules A and B; 2, formed by duplication modules B' and C; and 3, formed by duplication modules C' and D. Duplication modules B and B' are homologous, as are C and C'. Matching the sequences of 1, 2 and 3 followed by simple single-linkage clustering would put all three in the same family, even though proteins 1 and 3 are not related. This is because the different parts of 2 match 1 and 3. This paper describes a procedure, DIVCLUS, that divides such complex clusters of partially related sequences into simple clusters that contain only related duplication modules. In the example just given, it would produce two groups of sequences: the first with domains B of sequence 1 and B of sequence 2, and the second with domain C of sequence 2 and C of sequence 3. DIVCLUS is part of a package called GEANFAMMER, for GEnome ANalysis and protein FAMily MakER. The package automates the detection of families of duplication modules from a protein sequence database.
DIVCLUS has been applied to the division of single-linkage clusters generated from the protein sequences of six completely sequenced bacterial genomes. Out of 12 013 genes in these six genomes, 4563 single- and multi-domain sequences formed 1071 complex clusters. Application of the DIVCLUS program resolved these clusters into 2113 clusters corresponding to single duplication modules.
The perl5 program and its documentation are available at the following address: http://www.mrc-lmb.cam.ac.uk/genomes/ and by anonymous ftp at ftp.mrc-lmb.cam.ac.uk in the directory /pub/genomes/Software/.
sat@mrc-lmb.cam.ac.uk; jong@mrc-lmb. cam.ac.uk
大规模确定基因组序列所产生蛋白质之间的关系如今已很常见。所有蛋白质序列相互比对,那些具有高匹配分数的序列被聚类成家族。在蛋白质由多个结构域或重复模块构成的情况下,这可能会导致误导性结果。考虑三个蛋白质的非常简单的例子:蛋白质1由重复模块A和B构成;蛋白质2由重复模块B'和C构成;蛋白质3由重复模块C'和D构成。重复模块B和B'是同源的,C和C'也是同源的。对蛋白质1、2和3的序列进行比对,然后进行简单的单链聚类,会将这三个蛋白质都归入同一个家族,尽管蛋白质1和3并无关联。这是因为蛋白质2的不同部分与蛋白质1和3相匹配。本文描述了一种名为DIVCLUS的程序,它能将这种部分相关序列的复杂聚类划分为仅包含相关重复模块的简单聚类。在刚才给出的例子中,它会产生两组序列:第一组包含蛋白质1的结构域B和蛋白质2的结构域B,第二组包含蛋白质2的结构域C和蛋白质3的结构域C。DIVCLUS是名为GEANFAMMER(基因组分析和蛋白质家族生成器)软件包的一部分。该软件包可自动从蛋白质序列数据库中检测重复模块家族。
DIVCLUS已应用于对六个完全测序的细菌基因组的蛋白质序列生成的单链聚类进行划分。在这六个基因组的12013个基因中,4563个单结构域和多结构域序列形成了1071个复杂聚类。DIVCLUS程序的应用将这些聚类解析为对应于单个重复模块的2113个聚类。
perl5程序及其文档可在以下地址获取:http://www.mrc-lmb.cam.ac.uk/genomes/ ,也可通过匿名ftp从ftp.mrc-lmb.cam.ac.uk的/pub/genomes/Software/目录获取。