Doğan Tunca, MacDougall Alistair, Saidi Rabie, Poggioli Diego, Bateman Alex, O'Donovan Claire, Martin Maria J
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK.
Bioinformatics. 2016 Aug 1;32(15):2264-71. doi: 10.1093/bioinformatics/btw114. Epub 2016 Mar 7.
Similarity-based methods have been widely used in order to infer the properties of genes and gene products containing little or no experimental annotation. New approaches that overcome the limitations of methods that rely solely upon sequence similarity are attracting increased attention. One of these novel approaches is to use the organization of the structural domains in proteins.
We propose a method for the automatic annotation of protein sequences in the UniProt Knowledgebase (UniProtKB) by comparing their domain architectures, classifying proteins based on the similarities and propagating functional annotation. The performance of this method was measured through a cross-validation analysis using the Gene Ontology (GO) annotation of a sub-set of UniProtKB/Swiss-Prot. The results demonstrate the effectiveness of this approach in detecting functional similarity with an average F-score: 0.85. We applied the method on nearly 55.3 million uncharacterized proteins in UniProtKB/TrEMBL resulted in 44 818 178 GO term predictions for 12 172 114 proteins. 22% of these predictions were for 2 812 016 previously non-annotated protein entries indicating the significance of the value added by this approach.
The results of the method are available at: ftp://ftp.ebi.ac.uk/pub/contrib/martin/DAAC/ CONTACT: tdogan@ebi.ac.uk
Supplementary data are available at Bioinformatics online.
基于相似性的方法已被广泛应用于推断几乎没有或完全没有实验注释的基因和基因产物的特性。克服仅依赖序列相似性方法局限性的新方法正受到越来越多的关注。这些新方法之一是利用蛋白质中结构域的组织方式。
我们提出了一种通过比较蛋白质序列的结构域架构、基于相似性对蛋白质进行分类并传播功能注释,来自动注释通用蛋白质知识库(UniProtKB)中蛋白质序列的方法。该方法的性能通过使用UniProtKB/Swiss-Prot子集的基因本体(GO)注释进行交叉验证分析来衡量。结果表明该方法在检测功能相似性方面是有效的,平均F值为0.85。我们将该方法应用于UniProtKB/TrEMBL中近5530万个未表征的蛋白质,为12172114个蛋白质产生了44818178个GO术语预测。这些预测中有22%是针对2812016个以前未注释的蛋白质条目,表明该方法所增加价值的重要性。
该方法的结果可在以下网址获取:ftp://ftp.ebi.ac.uk/pub/contrib/martin/DAAC/
补充数据可在《生物信息学》在线获取。