Cao Jiuwen, Xiong Lianglin
Institute of Information and Control, Hangzhou Dianzi University, Zhejiang 310018, China.
School of Mathematics and Computer Science, Yunnan University of Nationalities, Kunming 650500, China ; School of Mathematics and Statistics, Yunnan University, Kunming 650091, China.
Biomed Res Int. 2014;2014:103054. doi: 10.1155/2014/103054. Epub 2014 Mar 30.
Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms.
从大型生物蛋白质序列数据库中精确分类蛋白质序列对于开发具有竞争力的药理产品具有重要作用。将未知序列与所有已识别的蛋白质序列进行比较,并返回相似度得分最高的蛋白质的类别索引,传统方法通常耗时较长。因此,构建一个高效的蛋白质序列分类系统迫在眉睫且十分必要。在本文中,我们研究了使用单隐层前馈神经网络(SLFNs)进行蛋白质序列分类的性能。近期高效的极限学习机(ELM)及其变体被用作训练算法。本文首次将最优剪枝极限学习机应用于蛋白质序列分类。为了进一步提高性能,构建了基于集成的SLFNs结构,其中使用多个具有相同数量隐藏节点和相同激活函数的SLFNs作为集成。对于每个集成,采用相同的训练算法。最终的类别索引通过多数投票法得出。基于集成的SLFNs采用了两种方法,即基本极限学习机(basic ELM)和最优剪枝极限学习机(OP-ELM)。使用从蛋白质信息资源中心获得的数据集,对性能进行了分析,并与几种现有方法进行了比较。实验结果表明了所提算法的优越性。