College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People's Republic of China.
BMC Bioinformatics. 2024 Jan 30;25(1):50. doi: 10.1186/s12859-024-05665-1.
Enzymes play an irreplaceable and important role in maintaining the lives of living organisms. The Enzyme Commission (EC) number of an enzyme indicates its essential functions. Correct identification of the first digit (family class) of the EC number for a given enzyme is a hot topic in the past twenty years. Several previous methods adopted functional domain composition to represent enzymes. However, it would lead to dimension disaster, thereby reducing the efficiency of the methods. On the other hand, most previous methods can only deal with enzymes belonging to one family class. In fact, several enzymes belong to two or more family classes.
In this study, a fast and efficient multi-label classifier, named PredictEFC, was designed. To construct this classifier, a novel feature extraction scheme was designed for processing functional domain information of enzymes, which counting the distribution of each functional domain entry across seven family classes in the training dataset. Based on this scheme, each training or test enzyme was encoded into a 7-dimenion vector by fusing its functional domain information and above statistical results. Random k-labelsets (RAKEL) was adopted to build the classifier, where random forest was selected as the base classification algorithm. The two tenfold cross-validation results on the training dataset shown that the accuracy of PredictEFC can reach 0.8493 and 0.8370. The independent test on two datasets indicated the accuracy values of 0.9118 and 0.8777.
The performance of PredictEFC was slightly lower than the classifier directly using functional domain composition. However, its efficiency was sharply improved. The running time was less than one-tenth of the time of the classifier directly using functional domain composition. In additional, the utility of PredictEFC was superior to the classifiers using traditional dimensionality reduction methods and some previous methods, and this classifier can be transplanted for predicting enzyme family classes of other species. Finally, a web-server available at http://124.221.158.221/ was set up for easy usage.
酶在维持生物生命活动中发挥着不可替代的重要作用。酶的酶委员会(EC)编号表明了其基本功能。正确识别给定酶的 EC 编号的第一位数字(家族类别)是过去 20 年来的热门话题。以前的几种方法采用功能域组成来表示酶。然而,这会导致维度灾难,从而降低方法的效率。另一方面,以前的大多数方法只能处理属于一个家族类别的酶。事实上,有几种酶属于两个或更多的家族类别。
本研究设计了一种快速高效的多标签分类器 PredictEFC。为了构建这个分类器,我们设计了一种新的特征提取方案,用于处理酶的功能域信息,该方案计算了训练数据集中每个功能域条目在七个家族类别中的分布。基于该方案,通过融合酶的功能域信息和上述统计结果,将每个训练或测试酶编码成一个 7 维向量。采用随机 k-标签集(RAKEL)构建分类器,其中选择随机森林作为基础分类算法。在训练数据集上的两次十折交叉验证结果表明,PredictEFC 的准确率可达 0.8493 和 0.8370。在两个数据集上的独立测试表明,准确率值分别为 0.9118 和 0.8777。
PredictEFC 的性能略低于直接使用功能域组成的分类器,但效率显著提高。运行时间不到直接使用功能域组成的分类器的十分之一。此外,PredictEFC 的实用性优于使用传统降维方法和以前一些方法的分类器,并且这个分类器可以移植到其他物种的酶家族类别预测中。最后,我们在 http://124.221.158.221/ 上建立了一个可供使用的网络服务器。