用于 Conotoxin 类和分子靶标预测的机器学习框架。

Machine Learning Framework for Conotoxin Class and Molecular Target Prediction.

机构信息

Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.

Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.

出版信息

Toxins (Basel). 2024 Nov 3;16(11):475. doi: 10.3390/toxins16110475.

DOI:10.3390/toxins16110475

PMID:39591230

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11598409/

Abstract

Conotoxins are small and highly potent neurotoxic peptides derived from the venom of marine cone snails which have captured the interest of the scientific community due to their pharmacological potential. These toxins display significant sequence and structure diversity, which results in a wide range of specificities for several different ion channels and receptors. Despite the recognized importance of these compounds, our ability to determine their binding targets and toxicities remains a significant challenge. Predicting the target receptors of conotoxins, based solely on their amino acid sequence, remains a challenge due to the intricate relationships between structure, function, target specificity, and the significant conformational heterogeneity observed in conotoxins with the same primary sequence. We have previously demonstrated that the inclusion of post-translational modifications, collisional cross sections values, and other structural features, when added to the standard primary sequence features, improves the prediction accuracy of conotoxins against non-toxic and other toxic peptides across varied datasets and several different commonly used machine learning classifiers. Here, we present the effects of these features on conotoxin class and molecular target predictions, in particular, predicting conotoxins that bind to nicotinic acetylcholine receptors (nAChRs). We also demonstrate the use of the Synthetic Minority Oversampling Technique (SMOTE)-Tomek in balancing the datasets while simultaneously making the different classes more distinct by reducing the number of ambiguous samples which nearly overlap between the classes. In predicting the alpha, mu, and omega conotoxin classes, the SMOTE-Tomek PCA PLR model, using the combination of the SS and P feature sets establishes the best performance with an overall accuracy (OA) of 95.95%, with an average accuracy (AA) of 93.04%, and an f1 score of 0.959. Using this model, we obtained sensitivities of 98.98%, 89.66%, and 90.48% when predicting alpha, mu, and omega conotoxin classes, respectively. Similarly, in predicting conotoxins that bind to nAChRs, the SMOTE-Tomek PCA SVM model, which used the collisional cross sections (CCSs) and the P feature sets, demonstrated the highest performance with 91.3% OA, 91.32% AA, and an f1 score of 0.9131. The sensitivity when predicting conotoxins that bind to nAChRs is 91.46% with a 91.18% sensitivity when predicting conotoxins that do not bind to nAChRs.

摘要

短缩螺贝类毒素(conotoxin)是一类源自海洋芋螺毒液的小分子、高活性神经毒素，因其潜在的药理学特性而引起科学界的关注。这些毒素具有显著的序列和结构多样性，导致对多种不同离子通道和受体具有广泛的特异性。尽管这些化合物具有公认的重要性，但我们确定其结合靶标和毒性的能力仍然是一个重大挑战。仅基于氨基酸序列预测短缩螺贝类毒素的靶受体仍然是一项挑战，因为结构、功能、靶标特异性之间存在复杂的关系，并且在具有相同一级序列的短缩螺贝类毒素中观察到显著的构象异质性。我们之前已经证明，当将翻译后修饰、碰撞截面值和其他结构特征添加到标准的一级序列特征中时，可以提高针对非毒性和其他毒性肽的短缩螺贝类毒素预测的准确性，这是在不同数据集和几种常用机器学习分类器上实现的。在这里，我们展示了这些特征对短缩螺贝类毒素类和分子靶标预测的影响，特别是预测与烟碱型乙酰胆碱受体(nAChR)结合的短缩螺贝类毒素。我们还展示了如何使用合成少数过采样技术(SMOTE)-Tomek 平衡数据集，同时通过减少类之间几乎重叠的模糊样本数量，使不同类更加明显。在预测 alpha、mu 和 omega 短缩螺贝类毒素类时，SMOTE-Tomek PCA PLR 模型使用 SS 和 P 特征集的组合，以整体准确性(OA)为 95.95%、平均准确性(AA)为 93.04%和 f1 分数为 0.959 的最佳性能建立了模型。使用该模型，我们分别获得了预测 alpha、mu 和 omega 短缩螺贝类毒素类时的 98.98%、89.66%和 90.48%的灵敏度。同样，在预测与 nAChR 结合的短缩螺贝类毒素时，SMOTE-Tomek PCA SVM 模型使用碰撞截面(CCS)和 P 特征集，以 91.3%OA、91.32%AA 和 f1 分数为 0.9131 的最佳性能建立了模型。预测与 nAChR 结合的短缩螺贝类毒素的灵敏度为 91.46%，预测不与 nAChR 结合的短缩螺贝类毒素的灵敏度为 91.18%。