School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.
Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, VIC 3800, Australia.
Brief Bioinform. 2019 Nov 27;20(6):2185-2199. doi: 10.1093/bib/bby079.
As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.
赖氨酸丙二酰化(Kmal)作为一种新发现的翻译后修饰(PTM),调节着从原核生物到真核生物的无数细胞过程,在人类疾病中具有重要意义。尽管其功能意义重大,但准确识别丙二酰化位点的计算方法仍然缺乏,且迫切需要。特别是,目前还没有全面分析和评估构建必要预测模型所需的不同特征和机器学习(ML)方法。在这里,我们回顾、分析和比较了 11 种不同的特征编码方法,旨在从 Kmal 位点的残基序列中提取关键模式和特征。我们确定了优化的特征集,并用其训练来自三种生物的四种常用 ML 方法(随机森林、支持向量机、K 最近邻和逻辑回归)和一种新提出的[Light Gradient Boosting Machine (LightGBM)],并使用随机 10 折交叉验证测试进行比较。我们表明,通过集成学习将单一方法模型集成可以进一步提高独立测试的预测性能和模型稳健性。与现有的最先进的预测器 MaloPred 相比,最优集成模型在所有三种生物(E. coli、M. musculus 和 H. sapiens 的 AUC:0.930、0.923 和 0.944)上的预测性能和模型稳健性都更高。我们使用集成模型开发了一个易于访问的在线预测器,kmal-sp,可在 http://kmalsp.erc.monash.edu/ 获得。我们希望,这项全面的调查和构建更准确模型的建议策略可以为启发未来的 PTM 位点预测计算方法的发展提供有用的指导,加速新的丙二酰化和其他 PTM 类型的发现,并促进针对新型丙二酰化底物和丙二酰化位点的假设驱动的实验验证。