Nair Rajesh, Rost Burkhard
Columbia University Bioinformatics Center (CUBIC), Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA.
Protein Sci. 2002 Dec;11(12):2836-47. doi: 10.1110/ps.0207402.
The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large-scale analysis of the relation between sequence similarity and identity in subcellular localization. Three results stood out: (1) The subcellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment; (2) the sequence conservation of localization is similar between different compartments; and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and nonconserved localization to be very sharp, although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP distance only for alignments in the subtwilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine tuning the thresholds. Finally, we applied our results to the entire SWISS-PROT database and five entirely sequenced eukaryotes.
蛋白质序列的差异越大,生物信息学从序列推断蛋白质功能和结构相似性就越困难。自动基因组注释中使用的精确阈值取决于通过同源性转移的蛋白质功能的特定方面。在这里,我们首次对亚细胞定位中的序列相似性和同一性之间的关系进行了大规模分析。有三个结果尤为突出:(1)考虑到像核定位信号这样的短序列基序可以改变天然区室,亚细胞区室通常比预期的更保守;(2)不同区室之间定位的序列保守性相似;(3)它与结构和酶活性的保守性相似。特别是,我们发现保守定位区域和非保守定位区域之间的转变非常明显,尽管保守性的阈值不如结构和酶活性的阈值定义明确。我们发现,一种考虑成对序列同一性和比对长度的序列相似性简单度量,即HSSP距离,能够准确区分相同定位和不同定位的蛋白质对。事实上,只有在次曙光区的比对中,BLAST期望值才优于HSSP距离。我们通过微调阈值,成功地略微提高了通过同源性推断定位的准确性。最后,我们将我们的结果应用于整个SWISS-PROT数据库和五个完全测序的真核生物。