Ru Xiaoqing, Li Lihong, Wang Chunyu
School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China.
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
Front Microbiol. 2019 Mar 26;10:507. doi: 10.3389/fmicb.2019.00507. eCollection 2019.
The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.
噬菌体的独特性在生物信息学研究中发挥着重要作用。在实际应用中,噬菌体病毒粒子蛋白的功能是主要关注领域。因此,准确区分噬菌体病毒粒子蛋白和非噬菌体病毒粒子蛋白非常重要。从蛋白质中提取全面有效的序列特征在蛋白质分类中起着至关重要的作用。为了更全面地表示蛋白质信息,本文通过结合基于序列信息的特征信息表示算法(CCPA)和基于序列与结构信息的特征表示算法,使特征提取更加全面有效。提取特征后,使用最大相关最大距离(MRMD)算法选择类标签之间相关性最强且特征之间冗余度低的最优特征集。鉴于随机森林分类算法所选择样本的随机性以及生成每个节点变量的随机特征,采用随机森林方法对噬菌体蛋白质分类进行10折交叉验证。在本研究中,该模型对噬菌体蛋白质分类的准确率高达93.5%。本研究还发现,在所考虑的八种理化性质中,电荷性质对噬菌体蛋白质分类的影响最大。这些结果表明,本文所讨论的模型是噬菌体蛋白质研究中的一个重要工具。