基于机器学习的防御素家族和亚家族预测方法的最新进展。

Recent development of machine learning-based methods for the prediction of defensin family and subfamily.

作者信息

Charoenkwan Phasit, Schaduangrat Nalini, Mahmud S M Hasan, Thinnukool Orawit, Shoombuatong Watshara

机构信息

Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand, 50200.

Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700.

出版信息

EXCLI J. 2022 May 5;21:757-771. doi: 10.17179/excli2022-4913. eCollection 2022.

DOI:10.17179/excli2022-4913

PMID:35949489

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9360473/

Abstract

Nearly all living species comprise of host defense peptides called defensins, that are crucial for innate immunity. These peptides work by activating the immune system which kills the microbes directly or indirectly, thus providing protection to the host. Thus far, numerous preclinical and clinical trials for peptide-based drugs are currently being evaluated. Although, experimental methods can help to precisely identify the defensin peptide family and subfamily, these approaches are often time-consuming and cost-ineffective. On the other hand, machine learning (ML) methods are able to effectively employ protein sequence information without the knowledge of a protein's three-dimensional structure, thus highlighting their predictive ability for the large-scale identification. To date, several ML methods have been developed for the identification of the defensin peptide family and subfamily. Therefore, summarizing the advantages and disadvantages of the existing methods is urgently needed in order to provide useful suggestions for the development and improvement of new computational models for the identification of the defensin peptide family and subfamily. With this goal in mind, we first provide a comprehensive survey on a collection of six state-of-the-art computational approaches for predicting the defensin peptide family and subfamily. Herein, we cover different important aspects, including the dataset quality, feature encoding methods, feature selection schemes, ML algorithms, cross-validation methods and web server availability/usability. Moreover, we provide our thoughts on the limitations of existing methods and future perspectives for improving the prediction performance and model interpretability. The insights and suggestions gained from this review are anticipated to serve as a valuable guidance for researchers for the development of more robust and useful predictors.

摘要

几乎所有生物物种都包含被称为防御素的宿主防御肽，这些肽对先天免疫至关重要。这些肽通过激活免疫系统来发挥作用，免疫系统直接或间接地杀死微生物，从而为宿主提供保护。到目前为止，许多基于肽的药物的临床前和临床试验正在进行评估。虽然实验方法有助于精确识别防御素肽家族和亚家族，但这些方法通常既耗时又成本低效。另一方面，机器学习（ML）方法能够在不了解蛋白质三维结构的情况下有效利用蛋白质序列信息，从而突出了它们在大规模识别方面的预测能力。迄今为止，已经开发了几种用于识别防御素肽家族和亚家族的ML方法。因此，迫切需要总结现有方法的优缺点，以便为开发和改进用于识别防御素肽家族和亚家族的新计算模型提供有用的建议。出于这个目标，我们首先对六种用于预测防御素肽家族和亚家族的最先进计算方法进行了全面综述。在此，我们涵盖了不同的重要方面，包括数据集质量、特征编码方法、特征选择方案、ML算法、交叉验证方法以及网络服务器的可用性/易用性。此外，我们对现有方法的局限性以及提高预测性能和模型可解释性的未来前景提出了自己的看法。预计从本综述中获得的见解和建议将为研究人员开发更强大、更有用的预测器提供有价值的指导。