Meher Prabina Kumar, Dash Sagarika, Sahu Tanmaya Kumar, Satpathy Subhrajit, Pradhan Sukanta Kumar
ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.
Division of Statistical Genetics, ICAR-IASRI, New Delhi-12, India.
Physiol Mol Biol Plants. 2022 Jan;28(1):1-16. doi: 10.1007/s12298-022-01130-6. Epub 2022 Jan 24.
In plants, GIGANTEA (GI) protein plays different biological functions including carbon and sucrose metabolism, cell wall deposition, transpiration and hypocotyl elongation. This suggests that GI is an important class of proteins. So far, the resource-intensive experimental methods have been mostly utilized for identification of GI proteins. Thus, we made an attempt in this study to develop a computational model for fast and accurate prediction of GI proteins. Ten different supervised learning algorithms i.e., SVM, RF, JRIP, J48, LMT, IBK, NB, PART, BAGG and LGB were employed for prediction, where the amino acid composition (AAC), FASGAI features and physico-chemical (PHYC) properties were used as numerical inputs for the learning algorithms. Higher accuracies i.e., 96.75% of AUC-ROC and 86.7% of AUC-PR were observed for SVM coupled with AAC + PHYC feature combination, while evaluated with five-fold cross validation. With leave-one-out cross validation, 97.29% of AUC-ROC and 87.89% of AUC-PR were respectively achieved. While the performance of the model was evaluated with an independent dataset of 18 GI sequences, 17 were observed as correctly predicted. We have also performed proteome-wide identification of GI proteins in wheat, followed by functional annotation using Gene Ontology terms. A prediction server "GIpred" is freely accessible at http://cabgrid.res.in:8080/gipred/ for proteome-wide recognition of GI proteins.
The online version contains supplementary material available at 10.1007/s12298-022-01130-6.
在植物中,巨大蛋白(GIGANTEA,GI)发挥着不同的生物学功能,包括碳和蔗糖代谢、细胞壁沉积、蒸腾作用以及下胚轴伸长。这表明GI是一类重要的蛋白质。到目前为止,资源密集型的实验方法大多用于GI蛋白的鉴定。因此,我们在本研究中尝试开发一种计算模型,用于快速准确地预测GI蛋白。使用了十种不同的监督学习算法,即支持向量机(SVM)、随机森林(RF)、JRIP、J48、LMT、IBK、朴素贝叶斯(NB)、PART、BAGG和LightGBM(LGB)进行预测,其中氨基酸组成(AAC)、FASGAI特征和物理化学(PHYC)性质被用作学习算法的数值输入。在五折交叉验证评估中,支持向量机与AAC + PHYC特征组合的预测准确率较高,AUC-ROC为96.75%,AUC-PR为86.7%。采用留一法交叉验证时,分别实现了97.29%的AUC-ROC和87.89%的AUC-PR。当使用包含18个GI序列的独立数据集评估模型性能时,观察到17个被正确预测。我们还对小麦中的GI蛋白进行了全蛋白质组鉴定,随后使用基因本体术语进行功能注释。预测服务器“GIpred”可在http://cabgrid.res.in:8080/gipred/免费访问,用于全蛋白质组范围内GI蛋白的识别。
在线版本包含可在10.1007/s12298-022-01130-6获取的补充材料。