University of Wrocław, Faculty of Biotechnology, Poland.
Warsaw University of Technology, Faculty of Mathematics and Information Science, Poland.
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac343.
Antimicrobial peptides (AMPs) are a heterogeneous group of short polypeptides that target not only microorganisms but also viruses and cancer cells. Due to their lower selection for resistance compared with traditional antibiotics, AMPs have been attracting the ever-growing attention from researchers, including bioinformaticians. Machine learning represents the most cost-effective method for novel AMP discovery and consequently many computational tools for AMP prediction have been recently developed. In this article, we investigate the impact of negative data sampling on model performance and benchmarking. We generated 660 predictive models using 12 machine learning architectures, a single positive data set and 11 negative data sampling methods; the architectures and methods were defined on the basis of published AMP prediction software. Our results clearly indicate that similar training and benchmark data set, i.e. produced by the same or a similar negative data sampling method, positively affect model performance. Consequently, all the benchmark analyses that have been performed for AMP prediction models are significantly biased and, moreover, we do not know which model is the most accurate. To provide researchers with reliable information about the performance of AMP predictors, we also created a web server AMPBenchmark for fair model benchmarking. AMPBenchmark is available at http://BioGenies.info/AMPBenchmark.
抗菌肽 (AMPs) 是一组具有异质性的短肽,不仅针对微生物,还针对病毒和癌细胞。由于与传统抗生素相比,它们的耐药性选择较低,因此 AMP 一直吸引着研究人员(包括生物信息学家)越来越多的关注。机器学习是发现新型 AMP 的最具成本效益的方法,因此最近开发了许多用于 AMP 预测的计算工具。在本文中,我们研究了负数据采样对模型性能和基准测试的影响。我们使用 12 种机器学习架构、一个单一的正数据集和 11 种负数据采样方法生成了 660 个预测模型;架构和方法是根据已发表的 AMP 预测软件定义的。我们的结果清楚地表明,相似的训练和基准数据集(即由相同或类似的负数据采样方法产生的数据集)会显著影响模型性能。因此,所有针对 AMP 预测模型进行的基准分析都存在显著的偏差,而且我们不知道哪个模型最准确。为了向研究人员提供有关 AMP 预测器性能的可靠信息,我们还创建了一个名为 AMPBenchmark 的网络服务器,用于公平的模型基准测试。AMPBenchmark 可在 http://BioGenies.info/AMPBenchmark 上获得。