利用综合机器学习框架中的信息特征对赖氨酸丙二酰化位点进行计算分析和预测。

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework.

机构信息

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.

Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, VIC 3800, Australia.

出版信息

Brief Bioinform. 2019 Nov 27;20(6):2185-2199. doi: 10.1093/bib/bby079.

DOI:10.1093/bib/bby079

PMID:30351377

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6954445/

Abstract

As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.

摘要

赖氨酸丙二酰化（Kmal）作为一种新发现的翻译后修饰（PTM），调节着从原核生物到真核生物的无数细胞过程，在人类疾病中具有重要意义。尽管其功能意义重大，但准确识别丙二酰化位点的计算方法仍然缺乏，且迫切需要。特别是，目前还没有全面分析和评估构建必要预测模型所需的不同特征和机器学习（ML）方法。在这里，我们回顾、分析和比较了 11 种不同的特征编码方法，旨在从 Kmal 位点的残基序列中提取关键模式和特征。我们确定了优化的特征集，并用其训练来自三种生物的四种常用 ML 方法（随机森林、支持向量机、K 最近邻和逻辑回归）和一种新提出的[Light Gradient Boosting Machine (LightGBM)]，并使用随机 10 折交叉验证测试进行比较。我们表明，通过集成学习将单一方法模型集成可以进一步提高独立测试的预测性能和模型稳健性。与现有的最先进的预测器 MaloPred 相比，最优集成模型在所有三种生物（E. coli、M. musculus 和 H. sapiens 的 AUC：0.930、0.923 和 0.944）上的预测性能和模型稳健性都更高。我们使用集成模型开发了一个易于访问的在线预测器，kmal-sp，可在 http://kmalsp.erc.monash.edu/ 获得。我们希望，这项全面的调查和构建更准确模型的建议策略可以为启发未来的 PTM 位点预测计算方法的发展提供有用的指导，加速新的丙二酰化和其他 PTM 类型的发现，并促进针对新型丙二酰化底物和丙二酰化位点的假设驱动的实验验证。

相似文献

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework.

Brief Bioinform. 2019 Nov 27;20(6):2185-2199. doi: 10.1093/bib/bby079.

Computational prediction of species-specific malonylation sites via enhanced characteristic strategy.

Bioinformatics. 2017 May 15;33(10):1457-1463. doi: 10.1093/bioinformatics/btw755.

Mal-Prec: computational prediction of protein Malonylation sites via machine learning based feature integration : Malonylation site prediction.

BMC Genomics. 2020 Nov 23;21(1):812. doi: 10.1186/s12864-020-07166-w.

Predicting lysine-malonylation sites of proteins using sequence and predicted structural features.

J Comput Chem. 2018 Aug 15;39(22):1757-1763. doi: 10.1002/jcc.25353. Epub 2018 May 14.

Mal-Light: Enhancing Lysine Malonylation Sites Prediction Problem Using Evolutionary-based Features.

IEEE Access. 2020;8:77888-77902. doi: 10.1109/access.2020.2989713. Epub 2020 Apr 22.

Large-scale comparative assessment of computational predictors for lysine post-translational modification sites.

Brief Bioinform. 2019 Nov 27;20(6):2267-2290. doi: 10.1093/bib/bby089.

Analysis and review of techniques and tools based on machine learning and deep learning for prediction of lysine malonylation sites in protein sequences.

Database (Oxford). 2024 Jan 19;2024. doi: 10.1093/database/baad094.

Global Profiling of Protein Lysine Malonylation in Escherichia coli Reveals Its Role in Energy Metabolism.

J Proteome Res. 2016 Jun 3;15(6):2060-71. doi: 10.1021/acs.jproteome.6b00264. Epub 2016 May 23.

PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins.

Bioinformatics. 2020 Feb 1;36(3):704-712. doi: 10.1093/bioinformatics/btz629.

Systematic analysis of the lysine malonylome in common wheat.

BMC Genomics. 2018 Mar 20;19(1):209. doi: 10.1186/s12864-018-4535-y.

引用本文的文献

Systematic qualitative proteome-wide analysis of lysine malonylation profiling in Platycodon grandiflorus.

Amino Acids. 2025 Jan 15;57(1):9. doi: 10.1007/s00726-024-03432-3.

Current computational tools for protein lysine acylation site prediction.

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae469.

Analysis and review of techniques and tools based on machine learning and deep learning for prediction of lysine malonylation sites in protein sequences.

Database (Oxford). 2024 Jan 19;2024. doi: 10.1093/database/baad094.

Advancing the accuracy of SARS-CoV-2 phosphorylation site detection via meta-learning approach.

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad433.

A Review of Machine Learning and Algorithmic Methods for Protein Phosphorylation Site Prediction.

Genomics Proteomics Bioinformatics. 2023 Dec;21(6):1266-1285. doi: 10.1016/j.gpb.2023.03.007. Epub 2023 Oct 19.

Drug-target interaction prediction based on protein features, using wrapper feature selection.

Sci Rep. 2023 Mar 3;13(1):3594. doi: 10.1038/s41598-023-30026-y.

Beyond metabolic waste: lysine lactylation and its potential roles in cancer progression and cell fate determination.

Cell Oncol (Dordr). 2023 Jun;46(3):465-480. doi: 10.1007/s13402-023-00775-z. Epub 2023 Jan 19.

BERT-Kgly: A Bidirectional Encoder Representations From Transformers (BERT)-Based Model for Predicting Lysine Glycation Site for .

Front Bioinform. 2022 Feb 18;2:834153. doi: 10.3389/fbinf.2022.834153. eCollection 2022.

Deep Neural Network Framework Based on Word Embedding for Protein Glutarylation Sites Prediction.

Life (Basel). 2022 Aug 10;12(8):1213. doi: 10.3390/life12081213.

Development of Machine-Learning Model to Predict COVID-19 Mortality: Application of Ensemble Model and Regarding Feature Impacts.

Diagnostics (Basel). 2022 Jun 14;12(6):1464. doi: 10.3390/diagnostics12061464.

本文引用的文献

Predicting lysine-malonylation sites of proteins using sequence and predicted structural features.

J Comput Chem. 2018 Aug 15;39(22):1757-1763. doi: 10.1002/jcc.25353. Epub 2018 May 14.

Feature selection with interactions in logistic regression models using multivariate synergies for a GWAS application.

BMC Genomics. 2018 Mar 21;19(Suppl 4):170. doi: 10.1186/s12864-018-4552-x.

Bastion6: a bioinformatics approach for accurate prediction of type VI secreted effectors.

Bioinformatics. 2018 Aug 1;34(15):2546-2555. doi: 10.1093/bioinformatics/bty155.

Features and regulation of non-enzymatic post-translational modifications.

Nat Chem Biol. 2018 Feb 14;14(3):244-252. doi: 10.1038/nchembio.2575.

PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework.

J Theor Biol. 2018 Apr 14;443:125-137. doi: 10.1016/j.jtbi.2018.01.023. Epub 2018 Feb 1.

Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches.

Brief Bioinform. 2019 May 21;20(3):931-951. doi: 10.1093/bib/bbx164.

PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine.

Bioinformatics. 2018 Apr 1;34(7):1092-1098. doi: 10.1093/bioinformatics/btx662.

PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy.

Bioinformatics. 2018 Feb 15;34(4):684-687. doi: 10.1093/bioinformatics/btx670.

POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles.

Bioinformatics. 2017 Sep 1;33(17):2756-2758. doi: 10.1093/bioinformatics/btx302.

Highly accurate prediction of protein self-interactions by incorporating the average block and PSSM information into the general PseAAC.

J Theor Biol. 2017 Nov 7;432:80-86. doi: 10.1016/j.jtbi.2017.08.009. Epub 2017 Aug 9.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用综合机器学习框架中的信息特征对赖氨酸丙二酰化位点进行计算分析和预测。

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献