Bizzotto Edoardo, Zampieri Guido, Treu Laura, Filannino Pasquale, Di Cagno Raffaella, Campanaro Stefano
Department of Biology, University of Padua, Via U. Bassi 58/b, Padova 35131, Italy.
Department of Soil, Plant and Food Science, University of Bari Aldo Moro, Via G. Amendola 165/a, Bari 70126, Italy.
Comput Struct Biotechnol J. 2024 May 23;23:2442-2452. doi: 10.1016/j.csbj.2024.05.040. eCollection 2024 Dec.
Bioactive peptides are short amino acid chains possessing biological activity and exerting physiological effects relevant to human health. Despite their therapeutic value, their identification remains a major problem, as it mainly relies on time-consuming in vitro tests. While bioinformatic tools for the identification of bioactive peptides are available, they are focused on specific functional classes and have not been systematically tested on realistic settings. To tackle this problem, bioactive peptide sequences and functions were here gathered from a variety of databases to generate a unified collection of bioactive peptides from microbial fermentation. This collection was organized into nine functional classes including some previously studied and some unexplored such as immunomodulatory, opioid and cardiovascular peptides. Upon assessing their sequence properties, four alternative encoding methods were tested in combination with a multitude of machine learning algorithms, from basic classifiers like logistic regression to advanced algorithms like BERT. Tests on a total of 171 models showed that, while some functions are intrinsically easier to detect, no single combination of classifiers and encoders worked universally well for all classes. For this reason, we unified all the best individual models for each class and generated CICERON (Classification of bIoaCtive pEptides fRom micrObial fermeNtation), a classification tool for the functional classification of peptides. State-of-the-art classifiers were found to underperform on our realistic benchmark dataset compared to the models included in CICERON. Altogether, our work provides a tool for real-world peptide classification and can serve as a benchmark for future model development.
生物活性肽是具有生物活性的短氨基酸链,对人体健康发挥着相关生理作用。尽管它们具有治疗价值,但其鉴定仍然是一个主要问题,因为这主要依赖于耗时的体外试验。虽然有用于鉴定生物活性肽的生物信息学工具,但它们专注于特定的功能类别,尚未在实际环境中进行系统测试。为了解决这个问题,这里从各种数据库收集了生物活性肽序列和功能,以生成一个来自微生物发酵的生物活性肽统一集合。这个集合被组织成九个功能类别,包括一些以前研究过的和一些未探索的类别,如免疫调节肽、阿片样肽和心血管肽。在评估它们的序列特性后,测试了四种替代编码方法,并与多种机器学习算法相结合,从逻辑回归等基本分类器到BERT等先进算法。对总共171个模型的测试表明,虽然有些功能本质上更容易检测,但没有一种分类器和编码器的组合对所有类别都普遍适用。因此,我们统一了每个类别的所有最佳个体模型,并生成了CICERON(来自微生物发酵的生物活性肽分类),这是一种用于肽功能分类的工具。与CICERON中包含的模型相比,发现最先进的分类器在我们的实际基准数据集上表现不佳。总之,我们的工作提供了一种用于现实世界肽分类的工具,并可作为未来模型开发的基准。