SAMP:基于比例分割氨基酸组成的集成学习模型鉴定抗菌肽

SAMP: Identifying antimicrobial peptides by an ensemble learning model based on proportionalized split amino acid composition.

作者信息

Feng Junxi, Sun Mengtao, Liu Cong, Zhang Weiwei, Xu Changmou, Wang Jieqiong, Wang Guangshun, Wan Shibiao

机构信息

Department of Biostatistics, School of Public Health, Harvard University, Boston, MA 02115, United States.

Department of Genetics, Cell Biology and Anatomy, College of Medicine, University of Nebraska Medical Center, Omaha, NE 68198, United States.

出版信息

Brief Funct Genomics. 2024 Dec 6;23(6):879-890. doi: 10.1093/bfgp/elae046.

Abstract

It is projected that 10 million deaths could be attributed to drug-resistant bacteria infections in 2050. To address this concern, identifying new-generation antibiotics is an effective way. Antimicrobial peptides (AMPs), a class of innate immune effectors, have received significant attention for their capacity to eliminate drug-resistant pathogens, including viruses, bacteria, and fungi. Recent years have witnessed widespread applications of computational methods especially machine learning (ML) and deep learning (DL) for discovering AMPs. However, existing methods only use features including compositional, physiochemical, and structural properties of peptides, which cannot fully capture sequence information from AMPs. Here, we present SAMP, an ensemble random projection (RP) based computational model that leverages a new type of feature called proportionalized split amino acid composition (PSAAC) in addition to conventional sequence-based features for AMP prediction. With this new feature set, SAMP captures the residue patterns like sorting signals at both the N-terminal and the C-terminal, while also retaining the sequence order information from the middle peptide fragments. Benchmarking tests on different balanced and imbalanced datasets demonstrate that SAMP consistently outperforms existing state-of-the-art methods, such as iAMPpred and AMPScanner V2, in terms of accuracy, Matthews correlation coefficient (MCC), G-measure, and F1-score. In addition, by leveraging an ensemble RP architecture, SAMP is scalable to processing large-scale AMP identification with further performance improvement, compared to those models without RP. To facilitate the use of SAMP, we have developed a Python package that is freely available at https://github.com/wan-mlab/SAMP.

摘要

据预测,到2050年,耐药菌感染可能导致1000万人死亡。为了解决这一问题,识别新一代抗生素是一种有效的方法。抗菌肽(AMPs)作为一类天然免疫效应物,因其能够消除包括病毒、细菌和真菌在内的耐药病原体的能力而受到广泛关注。近年来,计算方法尤其是机器学习(ML)和深度学习(DL)在发现抗菌肽方面得到了广泛应用。然而,现有方法仅使用包括肽的组成、理化和结构特性等特征,无法充分捕捉抗菌肽的序列信息。在此,我们提出了SAMP,这是一种基于集成随机投影(RP)的计算模型,除了用于抗菌肽预测的传统基于序列的特征外,还利用了一种称为比例化分割氨基酸组成(PSAAC)的新型特征。有了这个新的特征集,SAMP可以捕捉N端和C端类似分选信号的残基模式,同时还保留中间肽片段的序列顺序信息。在不同的平衡和不平衡数据集上的基准测试表明,在准确性、马修斯相关系数(MCC)、G-度量和F1分数方面,SAMP始终优于现有最先进的方法,如iAMPpred和AMPScanner V2。此外,通过利用集成RP架构,与没有RP的模型相比,SAMP在处理大规模抗菌肽识别时具有可扩展性,并且性能进一步提高。为了便于使用SAMP,我们开发了一个Python包,可在https://github.com/wan-mlab/SAMP上免费获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索