Amidi Shervine, Amidi Afshine, Vlachakis Dimitrios, Paragios Nikos, Zacharaki Evangelia I
Department of Applied Mathematics, Center for Visual Computing, Ecole Centrale de Paris (CentraleSupélec), Châtenay-Malabry, France.
MDAKM Group, Department of Computer Engineering and Informatics, University of Patras, Patras, Greece.
PeerJ. 2017 Mar 29;5:e3095. doi: 10.7717/peerj.3095. eCollection 2017.
The number of protein structures in the PDB database has been increasing more than 15-fold since 1999. The creation of computational models predicting enzymatic function is of major importance since such models provide the means to better understand the behavior of newly discovered enzymes when catalyzing chemical reactions. Until now, single-label classification has been widely performed for predicting enzymatic function limiting the application to enzymes performing unique reactions and introducing errors when multi-functional enzymes are examined. Indeed, some enzymes may be performing different reactions and can hence be directly associated with multiple enzymatic functions. In the present work, we propose a multi-label enzymatic function classification scheme that combines structural and amino acid sequence information. We investigate two fusion approaches (in the feature level and decision level) and assess the methodology for general enzymatic function prediction indicated by the first digit of the enzyme commission (EC) code (six main classes) on 40,034 enzymes from the PDB database. The proposed single-label and multi-label models predict correctly the actual functional activities in 97.8% and 95.5% (based on Hamming-loss) of the cases, respectively. Also the multi-label model predicts all possible enzymatic reactions in 85.4% of the multi-labeled enzymes when the number of reactions is unknown. Code and datasets are available at https://figshare.com/s/a63e0bafa9b71fc7cbd7.
自1999年以来,蛋白质数据银行(PDB)数据库中的蛋白质结构数量增加了15倍多。创建预测酶功能的计算模型至关重要,因为此类模型为更好地理解新发现的酶催化化学反应时的行为提供了手段。到目前为止,单标签分类已被广泛用于预测酶功能,这限制了其应用于执行独特反应的酶,并且在检查多功能酶时会引入错误。事实上,一些酶可能执行不同的反应,因此可以直接与多种酶功能相关联。在本研究中,我们提出了一种结合结构和氨基酸序列信息的多标签酶功能分类方案。我们研究了两种融合方法(特征级和决策级),并在来自PDB数据库的40034种酶上评估了由酶委员会(EC)代码的第一位数字(六个主要类别)表示的一般酶功能预测方法。所提出的单标签和多标签模型分别在97.8%和95.5%(基于汉明损失)的情况下正确预测了实际功能活性。此外,当反应数量未知时,多标签模型在85.4%的多标签酶中预测了所有可能的酶促反应。代码和数据集可在https://figshare.com/s/a63e0bafa9b71fc7cbd7获取。