Suppr超能文献

基于 SMOTE 和弹性网络的 LightGBM 分类器预测蛋白质巴豆酰化位点。

Prediction of protein crotonylation sites through LightGBM classifier based on SMOTE and elastic net.

机构信息

College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China.

College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.

出版信息

Anal Biochem. 2020 Nov 15;609:113903. doi: 10.1016/j.ab.2020.113903. Epub 2020 Aug 15.

Abstract

Lysine crotonylation is an important protein post-translational modification, which plays an important role in the process of chromosome organization and nucleic acid metabolism. Recognition of crotonylation sites is important to understand the function and mechanism of proteins. Traditional experimental methods are time-consuming and expensive, and can't predict crotonylation sites quickly and accurately. Therefore, this paper proposes a novel crotonylation sites prediction method called LightGBM-CroSite. First, binary encoding (BE), position weight amino acid composition (PWAA), encoding based on grouped weight (EBGW), k nearest neighbors (KNN), pseudo-position specific scoring matrix (PsePSSM) are used to extract features of protein sequences and obtain the original feature space. Second, the elastic net is used to remove redundant information and select the optimal feature subset. Third, the synthetic minority oversampling technique (SMOTE) is used to balance the samples. Finally, the balanced feature vectors are input into LightGBM to predict the crotonylation sites. According to the result of jackknife test, the Accuracy (ACC), Matthew's correlation coefficient (MCC) and area under ROC curve (AUC) are 98.99%, 0.9798 and 0.9996, respectively. Compared with other state-of-the-art methods, the results show that our method has a better model performance on the crotonylation sites prediction. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/LightGBM-CroSite/.

摘要

赖氨酸巴豆酰化是一种重要的蛋白质翻译后修饰,在染色体组织和核酸代谢过程中发挥着重要作用。巴豆酰化位点的识别对于了解蛋白质的功能和机制至关重要。传统的实验方法既耗时又昂贵,并且不能快速准确地预测巴豆酰化位点。因此,本文提出了一种名为 LightGBM-CroSite 的新型巴豆酰化位点预测方法。首先,使用二进制编码(BE)、位置权重氨基酸组成(PWAA)、基于分组权重的编码(EBGW)、k 最近邻(KNN)、伪位置特异性评分矩阵(PsePSSM)提取蛋白质序列的特征,得到原始特征空间。其次,使用弹性网络去除冗余信息并选择最佳特征子集。然后,使用合成少数过采样技术(SMOTE)平衡样本。最后,将平衡后的特征向量输入到 LightGBM 中预测巴豆酰化位点。根据 Jackknife 测试的结果,Accuracy(ACC)、Matthew's correlation coefficient(MCC)和 Area under ROC curve(AUC)分别为 98.99%、0.9798 和 0.9996。与其他最先进的方法相比,结果表明,我们的方法在巴豆酰化位点预测方面具有更好的模型性能。源代码和所有数据集可在 https://github.com/QUST-AIBBDRC/LightGBM-CroSite/ 获得。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验