College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China.
College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
Anal Biochem. 2020 Nov 15;609:113903. doi: 10.1016/j.ab.2020.113903. Epub 2020 Aug 15.
Lysine crotonylation is an important protein post-translational modification, which plays an important role in the process of chromosome organization and nucleic acid metabolism. Recognition of crotonylation sites is important to understand the function and mechanism of proteins. Traditional experimental methods are time-consuming and expensive, and can't predict crotonylation sites quickly and accurately. Therefore, this paper proposes a novel crotonylation sites prediction method called LightGBM-CroSite. First, binary encoding (BE), position weight amino acid composition (PWAA), encoding based on grouped weight (EBGW), k nearest neighbors (KNN), pseudo-position specific scoring matrix (PsePSSM) are used to extract features of protein sequences and obtain the original feature space. Second, the elastic net is used to remove redundant information and select the optimal feature subset. Third, the synthetic minority oversampling technique (SMOTE) is used to balance the samples. Finally, the balanced feature vectors are input into LightGBM to predict the crotonylation sites. According to the result of jackknife test, the Accuracy (ACC), Matthew's correlation coefficient (MCC) and area under ROC curve (AUC) are 98.99%, 0.9798 and 0.9996, respectively. Compared with other state-of-the-art methods, the results show that our method has a better model performance on the crotonylation sites prediction. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/LightGBM-CroSite/.
赖氨酸巴豆酰化是一种重要的蛋白质翻译后修饰,在染色体组织和核酸代谢过程中发挥着重要作用。巴豆酰化位点的识别对于了解蛋白质的功能和机制至关重要。传统的实验方法既耗时又昂贵,并且不能快速准确地预测巴豆酰化位点。因此,本文提出了一种名为 LightGBM-CroSite 的新型巴豆酰化位点预测方法。首先,使用二进制编码(BE)、位置权重氨基酸组成(PWAA)、基于分组权重的编码(EBGW)、k 最近邻(KNN)、伪位置特异性评分矩阵(PsePSSM)提取蛋白质序列的特征,得到原始特征空间。其次,使用弹性网络去除冗余信息并选择最佳特征子集。然后,使用合成少数过采样技术(SMOTE)平衡样本。最后,将平衡后的特征向量输入到 LightGBM 中预测巴豆酰化位点。根据 Jackknife 测试的结果,Accuracy(ACC)、Matthew's correlation coefficient(MCC)和 Area under ROC curve(AUC)分别为 98.99%、0.9798 和 0.9996。与其他最先进的方法相比,结果表明,我们的方法在巴豆酰化位点预测方面具有更好的模型性能。源代码和所有数据集可在 https://github.com/QUST-AIBBDRC/LightGBM-CroSite/ 获得。