Liu Xinyi, Shen Yueyue, Zhang Youhua, Liu Fei, Ma Zhiyu, Yue Zhenyu, Yue Yi
School of Information and Computer, Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, Anhui, China.
PeerJ. 2021 Aug 6;9:e11900. doi: 10.7717/peerj.11900. eCollection 2021.
A moonlighting protein refers to a protein that can perform two or more functions. Since the current moonlighting protein prediction tools mainly focus on the proteins in animals and microorganisms, and there are differences in the cells and proteins between animals and plants, these may cause the existing tools to predict plant moonlighting proteins inaccurately. Hence, the availability of a benchmark data set and a prediction tool specific for plant moonlighting protein are necessary.
This study used some protein feature classes from the data set constructed in house to develop a web-based prediction tool. In the beginning, we built a data set about plant protein and reduced redundant sequences. We then performed feature selection, feature normalization and feature dimensionality reduction on the training data. Next, machine learning methods for preliminary modeling were used to select feature classes that performed best in plant moonlighting protein prediction. This selected feature was incorporated into the final plant protein prediction tool. After that, we compared five machine learning methods and used grid searching to optimize parameters, and the most suitable method was chosen as the final model.
The prediction results indicated that the eXtreme Gradient Boosting (XGBoost) performed best, which was used as the algorithm to construct the prediction tool, called IdentPMP (Identification of Plant Moonlighting Proteins). The results of the independent test set shows that the area under the precision-recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUC) of IdentPMP is 0.43 and 0.68, which are 19.44% (0.43 vs. 0.36) and 13.33% (0.68 vs. 0.60) higher than state-of-the-art non-plant specific methods, respectively. This further demonstrated that a benchmark data set and a plant-specific prediction tool was required for plant moonlighting protein studies. Finally, we implemented the tool into a web version, and users can use it freely through the URL: http://identpmp.aielab.net/.
兼职蛋白是指能够执行两种或更多功能的蛋白质。由于当前的兼职蛋白预测工具主要集中于动物和微生物中的蛋白质,并且动植物之间的细胞和蛋白质存在差异,这可能导致现有工具对植物兼职蛋白的预测不准确。因此,需要一个基准数据集和专门针对植物兼职蛋白的预测工具。
本研究使用了内部构建的数据集中的一些蛋白质特征类别来开发基于网络的预测工具。首先,我们构建了一个关于植物蛋白质的数据集并减少冗余序列。然后,我们对训练数据进行特征选择、特征归一化和特征降维。接下来,使用机器学习方法进行初步建模,以选择在植物兼职蛋白预测中表现最佳的特征类别。这个选定的特征被纳入最终的植物蛋白质预测工具中。之后,我们比较了五种机器学习方法并使用网格搜索来优化参数,选择最合适的方法作为最终模型。
预测结果表明,极端梯度提升(XGBoost)表现最佳,被用作构建预测工具IdentPMP(植物兼职蛋白鉴定)的算法。独立测试集的结果表明,IdentPMP的精确召回率曲线下面积(AUPRC)和接收器操作特征曲线下面积(AUC)分别为0.43和0.68,分别比最先进的非植物特定方法高19.44%(0.43对0.36)和13.33%(0.68对0.60)。这进一步证明了植物兼职蛋白研究需要一个基准数据集和一个植物特定的预测工具。最后,我们将该工具实现为网络版本,用户可以通过URL:http://identpmp.aielab.net/免费使用它。