Alborzi Seyed Ziaeddin, Devignes Marie-Dominique, Ritchie David W
Université de Lorraine, LORIA, UMR, Vandœuvre-lès-Nancy, 7503, 54506, France.
Inria Nancy Grand-Es, Villers-lès-Nancy, 54600, France.
BMC Bioinformatics. 2017 Feb 13;18(1):107. doi: 10.1186/s12859-017-1519-x.
Many entries in the protein data bank (PDB) are annotated to show their component protein domains according to the Pfam classification, as well as their biological function through the enzyme commission (EC) numbering scheme. However, despite the fact that the biological activity of many proteins often arises from specific domain-domain and domain-ligand interactions, current on-line resources rarely provide a direct mapping from structure to function at the domain level. Since the PDB now contains many tens of thousands of protein chains, and since protein sequence databases can dwarf such numbers by orders of magnitude, there is a pressing need to develop automatic structure-function annotation tools which can operate at the domain level.
This article presents ECDomainMiner, a novel content-based filtering approach to automatically infer associations between EC numbers and Pfam domains. ECDomainMiner finds a total of 20,728 non-redundant EC-Pfam associations with a F-measure of 0.95 with respect to a "Gold Standard" test set extracted from InterPro. Compared to the 1515 manually curated EC-Pfam associations in InterPro, ECDomainMiner infers a 13-fold increase in the number of EC-Pfam associations.
These EC-Pfam associations could be used to annotate some 58,722 protein chains in the PDB which currently lack any EC annotation. The ECDomainMiner database is publicly available at http://ecdm.loria.fr/ .
蛋白质数据库(PDB)中的许多条目都根据Pfam分类法注释了其组成蛋白结构域,并通过酶学委员会(EC)编号系统注释了其生物学功能。然而,尽管许多蛋白质的生物学活性通常源于特定的结构域-结构域和结构域-配体相互作用,但目前的在线资源很少提供从结构到结构域水平功能的直接映射。由于PDB现在包含数以万计的蛋白质链,而且蛋白质序列数据库的数量可能比这个数量级大得多,因此迫切需要开发能够在结构域水平上运行的自动结构-功能注释工具。
本文介绍了ECDomainMiner,这是一种基于内容的新型过滤方法,用于自动推断EC编号和Pfam结构域之间的关联。相对于从InterPro提取的“金标准”测试集,ECDomainMiner总共发现了20728个非冗余的EC-Pfam关联,F值为0.95。与InterPro中1515个手动策划的EC-Pfam关联相比,ECDomainMiner推断的EC-Pfam关联数量增加了13倍。
这些EC-Pfam关联可用于注释PDB中目前缺乏任何EC注释的约58722条蛋白质链。ECDomainMiner数据库可在http://ecdm.loria.fr/ 上公开获取。