Suppr超能文献

多方面分析用于蛋白质二级结构预测的卷积神经网络的训练和测试。

Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction.

机构信息

Fox Chase Cancer Center, Philadelphia, PA, United States of America.

Temple University, Philadelphia, PA, United States of America.

出版信息

PLoS One. 2020 May 6;15(5):e0232528. doi: 10.1371/journal.pone.0232528. eCollection 2020.

Abstract

Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.

摘要

蛋白质二级结构预测仍然是一个具有广泛应用的重要课题。由于缺乏广泛接受的二级结构预测评估标准,因此对预测器进行公平比较具有挑战性。对导致更高准确性的因素也缺乏详细的研究。在本文中,我们提出了:(1)新的测试集 Test2018、Test2019 和 Test2018-2019,由 2018 年和 2019 年发布的结构中的蛋白质组成,与 2018 年之前发布的任何蛋白质的相似度均小于 25%;(2)一个 4 层卷积神经网络 SecNet,输入窗口为±14 个氨基酸,该网络在与 Test2018 中的蛋白质相似度小于 25%的蛋白质和常用的 CB513 测试集中进行了训练;(3)一个与训练集蛋白质没有同源结构域的额外测试集,根据进化分类蛋白数据库(ECOD);(4)一个详细的消融研究,我们每次在 SecNet 中反转一个算法选择,并评估其对预测准确性的影响;(5)新的 4 位和 5 位预测字母表,这些字母表可能对三级结构预测方法更实用。在 Test2018 和 CB513 上,领先预测器的 3 位标签准确性(螺旋、片层、无规卷曲)为 81-82%,而 SecNet 的准确性分别为 84%。在非同源 ECOD 集上的准确性仅比 Test2018-2019 集(84.5%)低 0.6 个百分点(83.9%)。对特征、神经网络架构和训练超参数的消融研究表明,在每个方面都做出良好选择时,可以达到最佳的准确性结果,而神经网络架构只要不太简单,就不是那么关键。提供了生成和使用无偏测试、验证和训练集的协议。我们的数据集中包括输入特征和分配的标签,以及 SecNet 软件,包括第三方依赖项和数据库,可从 dunbrack.fccc.edu/ss 和 github.com/sh-maxim/ss 下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f710/7202669/270026c5f95d/pone.0232528.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验