Petryszak Robert, Kretschmann Ernst, Wieser Daniela, Apweiler Rolf
EMBL Outstation Hinxton, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK.
Bioinformatics. 2005 Sep 15;21(18):3604-9. doi: 10.1093/bioinformatics/bti542. Epub 2005 Jun 16.
The CluSTr database employs a fully automatic single-linkage hierarchical clustering method based on a similarity matrix. In order to compute the matrix, first all-against-all pair-wise comparisons between protein sequences are computed using the Smith-Waterman algorithm. The statistical significance of the similarity scores is then assessed using a Monte Carlo analysis, yielding Z-values, which are used to populate the matrix. This paper describes automated annotation experiments that quantify the predictive power and hence the biological relevance of the CluSTr data. The experiments utilized the UniProt data-mining framework to derive annotation predictions using combinations of InterPro and CluSTr. We show that this combination of data sources greatly increases the precision of predictions made by the data-mining framework, compared with the use of InterPro data alone. We conclude that the CluSTr approach to clustering proteins makes a valuable contribution to traditional protein classifications.
CluSTr数据库采用基于相似性矩阵的全自动单连锁层次聚类方法。为了计算该矩阵,首先使用Smith-Waterman算法计算蛋白质序列之间的全对全两两比较。然后使用蒙特卡罗分析评估相似性得分的统计显著性,得出Z值,用于填充矩阵。本文描述了自动注释实验,这些实验量化了CluSTr数据的预测能力及其生物学相关性。实验利用UniProt数据挖掘框架,通过InterPro和CluSTr的组合得出注释预测。我们表明,与单独使用InterPro数据相比,这种数据源的组合大大提高了数据挖掘框架所做预测的精度。我们得出结论,CluSTr蛋白质聚类方法对传统蛋白质分类做出了有价值的贡献。