SPOTONE：基于序列特征的极度随机化树的蛋白质复合物热点。

SPOTONE: Hot Spots on Protein Complexes with Extremely Randomized Trees via Sequence-Only Features.

机构信息

CNC-Center for Neuroscience and Cell Biology, University of Coimbra, 3004-504 Coimbra, Portugal.

Department of Life Sciences, Center for Neuroscience and Cell Biology, Coimbra University, 3000-456 Coimbra, Portugal.

出版信息

Int J Mol Sci. 2020 Oct 1;21(19):7281. doi: 10.3390/ijms21197281.

DOI:10.3390/ijms21197281

PMID:33019775

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7582262/

Abstract

Protein Hot-Spots (HS) are experimentally determined amino acids, key to small ligand binding and tend to be structural landmarks on protein-protein interactions. As such, they were extensively approached by structure-based Machine Learning (ML) prediction methods. However, the availability of a much larger array of protein sequences in comparison to determined tree-dimensional structures indicates that a sequence-based HS predictor has the potential to be more useful for the scientific community. Herein, we present SPOTONE, a new ML predictor able to accurately classify protein HS via sequence-only features. This algorithm shows accuracy, AUROC, precision, recall and F1-score of 0.82, 0.83, 0.91, 0.82 and 0.85, respectively, on an independent testing set. The algorithm is deployed within a free-to-use webserver at http://moreiralab.com/resources/spotone, only requiring the user to submit a FASTA file with one or more protein sequences.

摘要

蛋白质热点（HS）是经过实验确定的氨基酸，是小分子配体结合的关键，并且往往是蛋白质-蛋白质相互作用的结构标志。因此，它们被基于结构的机器学习（ML）预测方法广泛研究。然而，与已确定的三维结构相比，蛋白质序列的可用性要大得多，这表明基于序列的 HS 预测器有可能对科学界更有用。在此，我们介绍了 SPOTONE，这是一种新的 ML 预测器，能够仅通过序列特征准确地对蛋白质 HS 进行分类。该算法在独立测试集上的准确率、AUROC、精确率、召回率和 F1 得分为 0.82、0.83、0.91、0.82 和 0.85。该算法已在免费使用的网络服务器 http://moreiralab.com/resources/spotone 中部署，用户只需提交一个 FASTA 文件，其中包含一个或多个蛋白质序列。