多方面分析用于蛋白质二级结构预测的卷积神经网络的训练和测试。

Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction.

机构信息

Fox Chase Cancer Center, Philadelphia, PA, United States of America.

Temple University, Philadelphia, PA, United States of America.

出版信息

PLoS One. 2020 May 6;15(5):e0232528. doi: 10.1371/journal.pone.0232528. eCollection 2020.

DOI:10.1371/journal.pone.0232528

PMID:32374785

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7202669/

Abstract

Protein secondary structure prediction remains a vital topic with broad applications. Due to lack of a widely accepted standard in secondary structure predictor evaluation, a fair comparison of predictors is challenging. A detailed examination of factors that contribute to higher accuracy is also lacking. In this paper, we present: (1) new test sets, Test2018, Test2019, and Test2018-2019, consisting of proteins from structures released in 2018 and 2019 with less than 25% identity to any protein published before 2018; (2) a 4-layer convolutional neural network, SecNet, with an input window of ±14 amino acids which was trained on proteins ≤25% identical to proteins in Test2018 and the commonly used CB513 test set; (3) an additional test set that shares no homologous domains with the training set proteins, according to the Evolutionary Classification of Proteins (ECOD) database; (4) a detailed ablation study where we reverse one algorithmic choice at a time in SecNet and evaluate the effect on the prediction accuracy; (5) new 4- and 5-label prediction alphabets that may be more practical for tertiary structure prediction methods. The 3-label accuracy (helix, sheet, coil) of the leading predictors on both Test2018 and CB513 is 81-82%, while SecNet's accuracy is 84% for both sets. Accuracy on the non-homologous ECOD set is only 0.6 points (83.9%) lower than the results on the Test2018-2019 set (84.5%). The ablation study of features, neural network architecture, and training hyper-parameters suggests the best accuracy results are achieved with good choices for each of them while the neural network architecture is not as critical as long as it is not too simple. Protocols for generating and using unbiased test, validation, and training sets are provided. Our data sets, including input features and assigned labels, and SecNet software including third-party dependencies and databases, are downloadable from dunbrack.fccc.edu/ss and github.com/sh-maxim/ss.

摘要

蛋白质二级结构预测仍然是一个具有广泛应用的重要课题。由于缺乏广泛接受的二级结构预测评估标准，因此对预测器进行公平比较具有挑战性。对导致更高准确性的因素也缺乏详细的研究。在本文中，我们提出了：（1）新的测试集 Test2018、Test2019 和 Test2018-2019，由 2018 年和 2019 年发布的结构中的蛋白质组成，与 2018 年之前发布的任何蛋白质的相似度均小于 25%；（2）一个 4 层卷积神经网络 SecNet，输入窗口为±14 个氨基酸，该网络在与 Test2018 中的蛋白质相似度小于 25%的蛋白质和常用的 CB513 测试集中进行了训练；（3）一个与训练集蛋白质没有同源结构域的额外测试集，根据进化分类蛋白数据库（ECOD）；（4）一个详细的消融研究，我们每次在 SecNet 中反转一个算法选择，并评估其对预测准确性的影响；（5）新的 4 位和 5 位预测字母表，这些字母表可能对三级结构预测方法更实用。在 Test2018 和 CB513 上，领先预测器的 3 位标签准确性（螺旋、片层、无规卷曲）为 81-82%，而 SecNet 的准确性分别为 84%。在非同源 ECOD 集上的准确性仅比 Test2018-2019 集（84.5%）低 0.6 个百分点（83.9%）。对特征、神经网络架构和训练超参数的消融研究表明，在每个方面都做出良好选择时，可以达到最佳的准确性结果，而神经网络架构只要不太简单，就不是那么关键。提供了生成和使用无偏测试、验证和训练集的协议。我们的数据集中包括输入特征和分配的标签，以及 SecNet 软件，包括第三方依赖项和数据库，可从 dunbrack.fccc.edu/ss 和 github.com/sh-maxim/ss 下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f710/7202669/270026c5f95d/pone.0232528.g001.jpg

相似文献

Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction.多方面分析用于蛋白质二级结构预测的卷积神经网络的训练和测试。

PLoS One. 2020 May 6;15(5):e0232528. doi: 10.1371/journal.pone.0232528. eCollection 2020.

MUFOLD-SS: New deep inception-inside-inception networks for protein secondary structure prediction.MUFOLD-SS：用于蛋白质二级结构预测的新深度 inception-inside-inception 网络。

Proteins. 2018 May;86(5):592-598. doi: 10.1002/prot.25487. Epub 2018 Mar 12.

MABAL: a Novel Deep-Learning Architecture for Machine-Assisted Bone Age Labeling.MABAL：一种用于机器辅助骨龄标注的新型深度学习架构。

J Digit Imaging. 2018 Aug;31(4):513-519. doi: 10.1007/s10278-018-0053-3.

PCP-GC-LM: single-sequence-based protein contact prediction using dual graph convolutional neural network and convolutional neural network.PCP-GC-LM：基于双图卷积神经网络和卷积神经网络的单序列蛋白质接触预测。

BMC Bioinformatics. 2024 Sep 2;25(1):287. doi: 10.1186/s12859-024-05914-3.

IGPRED-MultiTask: A Deep Learning Model to Predict Protein Secondary Structure, Torsion Angles and Solvent Accessibility.IGPRED-MultiTask：一种用于预测蛋白质二级结构、扭转角和溶剂可及性的深度学习模型。

IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):1104-1113. doi: 10.1109/TCBB.2022.3191395. Epub 2023 Apr 3.

DNSS2: Improved ab initio protein secondary structure prediction using advanced deep learning architectures.DNSS2：使用先进深度学习架构改进从头算蛋白质二级结构预测

Proteins. 2021 Feb;89(2):207-217. doi: 10.1002/prot.26007. Epub 2020 Sep 16.

Prediction of 8-state protein secondary structures by a novel deep learning architecture.一种新型深度学习架构预测 8 态蛋白质二级结构。

BMC Bioinformatics. 2018 Aug 3;19(1):293. doi: 10.1186/s12859-018-2280-5.

MFTrans: A multi-feature transformer network for protein secondary structure prediction.MFTrans：一种用于蛋白质二级结构预测的多特征变换网络。

Int J Biol Macromol. 2024 May;267(Pt 1):131311. doi: 10.1016/j.ijbiomac.2024.131311. Epub 2024 Apr 9.

SAINT: self-attention augmented inception-inside-inception network improves protein secondary structure prediction.SAINT：自注意力增强型 inception-inside-inception 网络提高蛋白质二级结构预测。

Bioinformatics. 2020 Nov 1;36(17):4599-4608. doi: 10.1093/bioinformatics/btaa531.

Predicting dihedral angle probability distributions for protein coil residues from primary sequence using neural networks.利用神经网络从蛋白质序列预测无规卷曲残基的二面角概率分布。

BMC Bioinformatics. 2009 Oct 16;10:338. doi: 10.1186/1471-2105-10-338.

引用本文的文献

Post-processing enhances protein secondary structure prediction with second order deep learning and embeddings.后处理通过二阶深度学习和嵌入增强蛋白质二级结构预测。

Comput Struct Biotechnol J. 2025 Jan 2;27:243-251. doi: 10.1016/j.csbj.2024.12.022. eCollection 2025.

Deep learning for protein secondary structure prediction: Pre and post-AlphaFold.用于蛋白质二级结构预测的深度学习：AlphaFold之前与之后。

Comput Struct Biotechnol J. 2022 Nov 11;20:6271-6286. doi: 10.1016/j.csbj.2022.11.012. eCollection 2022.

Multistage Combination Classifier Augmented Model for Protein Secondary Structure Prediction.用于蛋白质二级结构预测的多级组合分类器增强模型。

Front Genet. 2022 May 23;13:769828. doi: 10.3389/fgene.2022.769828. eCollection 2022.

Ensemble of Template-Free and Template-Based Classifiers for Protein Secondary Structure Prediction.无模板和基于模板的分类器集成方法用于蛋白质二级结构预测。

Int J Mol Sci. 2021 Oct 23;22(21):11449. doi: 10.3390/ijms222111449.

PYTHIA: Deep Learning Approach for Local Protein Conformation Prediction.PYTHIA：用于局部蛋白质构象预测的深度学习方法。

Int J Mol Sci. 2021 Aug 17;22(16):8831. doi: 10.3390/ijms22168831.

Deep geometric representations for modeling effects of mutations on protein-protein binding affinity.用于模拟突变对蛋白质-蛋白质结合亲和力影响的深度几何表示。

PLoS Comput Biol. 2021 Aug 4;17(8):e1009284. doi: 10.1371/journal.pcbi.1009284. eCollection 2021 Aug.

The whole is greater than its parts: ensembling improves protein contact prediction.整体大于部分之和：集成可提高蛋白质接触预测。

Sci Rep. 2021 Apr 13;11(1):8039. doi: 10.1038/s41598-021-87524-0.

本文引用的文献

A new clustering and nomenclature for beta turns derived from high-resolution protein structures.从高分辨率蛋白质结构中得出的β转角的新聚类和命名法。

PLoS Comput Biol. 2019 Mar 7;15(3):e1006844. doi: 10.1371/journal.pcbi.1006844. eCollection 2019 Mar.

Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks.利用预测的接触图和递归与残差卷积神经网络的集合来改进蛋白质二级结构、主链角度、溶剂可及性和接触数的预测。

Bioinformatics. 2019 Jul 15;35(14):2403-2410. doi: 10.1093/bioinformatics/bty1006.

Prediction of 8-state protein secondary structures by a novel deep learning architecture.一种新型深度学习架构预测 8 态蛋白质二级结构。

BMC Bioinformatics. 2018 Aug 3;19(1):293. doi: 10.1186/s12859-018-2280-5.

Protein Secondary Structure Prediction Based on Data Partition and Semi-Random Subspace Method.基于数据分区和半随机子空间方法的蛋白质二级结构预测。

Sci Rep. 2018 Jun 29;8(1):9856. doi: 10.1038/s41598-018-28084-8.

CNNH_PSS: protein 8-class secondary structure prediction by convolutional neural network with highway.CNN_H_PSS：基于卷积神经网络和高速公路的 8 类蛋白质二级结构预测。

BMC Bioinformatics. 2018 May 8;19(Suppl 4):60. doi: 10.1186/s12859-018-2067-8.

MUFOLD-SS: New deep inception-inside-inception networks for protein secondary structure prediction.MUFOLD-SS：用于蛋白质二级结构预测的新深度 inception-inside-inception 网络。

Proteins. 2018 May;86(5):592-598. doi: 10.1002/prot.25487. Epub 2018 Mar 12.

Protein secondary structure prediction based on the fuzzy support vector machine with the hyperplane optimization.基于超平面优化的模糊支持向量机的蛋白质二级结构预测

Gene. 2018 Feb 5;642:74-83. doi: 10.1016/j.gene.2017.11.005. Epub 2017 Nov 14.

Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility.利用长短期记忆双向递归神经网络捕捉非局部相互作用，提高蛋白质二级结构、主链角度、接触数和溶剂可及性的预测能力。

Bioinformatics. 2017 Sep 15;33(18):2842-2849. doi: 10.1093/bioinformatics/btx218.

Protein structure determination using metagenome sequence data.利用宏基因组序列数据进行蛋白质结构测定。

Science. 2017 Jan 20;355(6322):294-298. doi: 10.1126/science.aah4043.

Sixty-five years of the long march in protein secondary structure prediction: the final stretch?蛋白质二级结构预测的长征：终章？

Brief Bioinform. 2018 May 1;19(3):482-494. doi: 10.1093/bib/bbw129.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

多方面分析用于蛋白质二级结构预测的卷积神经网络的训练和测试。

Multifaceted analysis of training and testing convolutional neural networks for protein secondary structure prediction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献