Wang Jen-Hung, Sung Ting-Yi
Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan.
ACS Omega. 2024 Jul 11;9(29):32116-32123. doi: 10.1021/acsomega.4c04246. eCollection 2024 Jul 23.
Examining the toxicity of peptides is essential for therapeutic peptide-based drug design. Machine learning approaches are frequently used to develop highly accurate predictors for peptide toxicity prediction. In this paper, we present ToxTeller, which provides four predictors using logistic regression, support vector machines, random forests, and XGBoost, respectively. For prediction model development, we construct a data set of toxic and nontoxic peptides from SwissProt and ConoServer databases with existence evidence levels checked. We also fully utilize the protein annotation in SwissProt to collect more toxic peptides than using keyword search alone. From this data set, we construct an independent test data set that shares at most 40% sequence similarity within itself and with the training data set. From a quite comprehensive list of 28 feature combinations, we conduct 10-fold cross-validation on the training data set to determine the optimized feature combination for model development. ToxTeller's performance is evaluated and compared with existing predictors on the independent test data set. Since toxic peptides must be avoided for drug design, we analyze strategies for reducing false-negative predictions of toxic peptides and suggest selecting models by top sensitivity instead of the widely used Matthews correlation coefficient, and also suggest using a -predictor approach with multiple predictors.
研究肽的毒性对于基于肽的治疗性药物设计至关重要。机器学习方法经常被用于开发用于肽毒性预测的高精度预测器。在本文中,我们介绍了ToxTeller,它分别使用逻辑回归、支持向量机、随机森林和XGBoost提供了四个预测器。对于预测模型的开发,我们从SwissProt和ConoServer数据库构建了一个有毒和无毒肽的数据集,并检查了存在证据水平。我们还充分利用SwissProt中的蛋白质注释来收集比仅使用关键词搜索更多的有毒肽。从这个数据集中,我们构建了一个独立的测试数据集,该数据集自身以及与训练数据集的序列相似性最多为40%。从28种特征组合的相当全面的列表中,我们在训练数据集上进行10折交叉验证,以确定用于模型开发的优化特征组合。在独立测试数据集上评估了ToxTeller的性能,并与现有预测器进行了比较。由于药物设计必须避免有毒肽,我们分析了减少有毒肽假阴性预测的策略,并建议通过最高灵敏度而不是广泛使用的马修斯相关系数来选择模型,还建议使用具有多个预测器的多预测器方法。