Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark.
Department of Bio and Health Informatics, Technical University of Denmark, Kongens Lyngby, Denmark.
Proteins. 2019 Jun;87(6):520-527. doi: 10.1002/prot.25674. Epub 2019 Mar 9.
The ability to predict local structural features of a protein from the primary sequence is of paramount importance for unraveling its function in absence of experimental structural information. Two main factors affect the utility of potential prediction tools: their accuracy must enable extraction of reliable structural information on the proteins of interest, and their runtime must be low to keep pace with sequencing data being generated at a constantly increasing speed. Here, we present NetSurfP-2.0, a novel tool that can predict the most important local structural features with unprecedented accuracy and runtime. NetSurfP-2.0 is sequence-based and uses an architecture composed of convolutional and long short-term memory neural networks trained on solved protein structures. Using a single integrated model, NetSurfP-2.0 predicts solvent accessibility, secondary structure, structural disorder, and backbone dihedral angles for each residue of the input sequences. We assessed the accuracy of NetSurfP-2.0 on several independent test datasets and found it to consistently produce state-of-the-art predictions for each of its output features. We observe a correlation of 80% between predictions and experimental data for solvent accessibility, and a precision of 85% on secondary structure 3-class predictions. In addition to improved accuracy, the processing time has been optimized to allow predicting more than 1000 proteins in less than 2 hours, and complete proteomes in less than 1 day.
从一级序列预测蛋白质的局部结构特征对于在缺乏实验结构信息的情况下揭示其功能至关重要。有两个主要因素影响潜在预测工具的实用性:它们的准确性必须能够提取出感兴趣的蛋白质的可靠结构信息,并且它们的运行时间必须足够低,以跟上以不断增加的速度生成的测序数据。在这里,我们介绍了 NetSurfP-2.0,这是一种新工具,它可以以前所未有的准确性和运行时间预测最重要的局部结构特征。NetSurfP-2.0 是基于序列的,使用卷积和长短期记忆神经网络架构,这些网络是基于已解决的蛋白质结构进行训练的。NetSurfP-2.0 使用单个集成模型,预测输入序列中每个残基的溶剂可及性、二级结构、结构无序和主链二面角。我们在几个独立的测试数据集上评估了 NetSurfP-2.0 的准确性,发现它对其每个输出特征的预测结果都达到了最先进的水平。我们观察到溶剂可及性的预测值与实验数据之间存在 80%的相关性,并且在二级结构 3 类预测方面的精度达到 85%。除了提高准确性之外,处理时间也进行了优化,允许在不到 2 小时内预测超过 1000 种蛋白质,并且在不到 1 天的时间内完成整个蛋白质组的预测。