ProteinNet：用于蛋白质结构机器学习的标准化数据集。

ProteinNet: a standardized data set for machine learning of protein structure.

机构信息

Laboratory of Systems Pharmacology, Department of Systems Biology, Harvard Medical School, 200 Longwood Avenue, Boston, MA, 02115, USA.

出版信息

BMC Bioinformatics. 2019 Jun 11;20(1):311. doi: 10.1186/s12859-019-2932-0.

DOI:10.1186/s12859-019-2932-0

PMID:31185886

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6560865/

Abstract

BACKGROUND

Rapid progress in deep learning has spurred its application to bioinformatics problems including protein structure prediction and design. In classic machine learning problems like computer vision, progress has been driven by standardized data sets that facilitate fair assessment of new methods and lower the barrier to entry for non-domain experts. While data sets of protein sequence and structure exist, they lack certain components critical for machine learning, including high-quality multiple sequence alignments and insulated training/validation splits that account for deep but only weakly detectable homology across protein space.

RESULTS

We created the ProteinNet series of data sets to provide a standardized mechanism for training and assessing data-driven models of protein sequence-structure relationships. ProteinNet integrates sequence, structure, and evolutionary information in programmatically accessible file formats tailored for machine learning frameworks. Multiple sequence alignments of all structurally characterized proteins were created using substantial high-performance computing resources. Standardized data splits were also generated to emulate the difficulty of past CASP (Critical Assessment of protein Structure Prediction) experiments by resetting protein sequence and structure space to the historical states that preceded six prior CASPs. Utilizing sensitive evolution-based distance metrics to segregate distantly related proteins, we have additionally created validation sets distinct from the official CASP sets that faithfully mimic their difficulty.

CONCLUSION

ProteinNet represents a comprehensive and accessible resource for training and assessing machine-learned models of protein structure.

摘要

背景

深度学习的快速发展推动了其在生物信息学问题中的应用，包括蛋白质结构预测和设计。在计算机视觉等经典机器学习问题中，进展得益于标准化数据集，这有利于公平评估新方法，并降低非专业人士的进入门槛。虽然存在蛋白质序列和结构的数据集，但它们缺乏机器学习关键的某些组件，包括高质量的多重序列比对和隔离的训练/验证分割，这些组件考虑了蛋白质空间中深度但仅微弱可检测的同源性。

结果

我们创建了 ProteinNet 系列数据集，为训练和评估基于数据的蛋白质序列-结构关系模型提供了标准化机制。ProteinNet 以适合机器学习框架的可编程访问文件格式集成了序列、结构和进化信息。使用大量高性能计算资源创建了所有结构特征化蛋白质的多重序列比对。还生成了标准化数据分割，通过将蛋白质序列和结构空间重置为六个之前的 CASP 之前的历史状态，来模拟过去 CASP（蛋白质结构预测关键评估）实验的难度。利用基于敏感进化的距离度量来隔离远缘相关的蛋白质，我们还创建了与官方 CASP 集不同的验证集，忠实地模拟了它们的难度。

结论

ProteinNet 代表了用于训练和评估蛋白质结构的基于机器学习模型的全面和可访问资源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b748/6560865/aad9401d9c1b/12859_2019_2932_Fig1_HTML.jpg

相似文献

ProteinNet: a standardized data set for machine learning of protein structure.

BMC Bioinformatics. 2019 Jun 11;20(1):311. doi: 10.1186/s12859-019-2932-0.

Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.

Proteins. 2018 Mar;86 Suppl 1(Suppl 1):84-96. doi: 10.1002/prot.25405. Epub 2017 Oct 31.

SidechainNet: An all-atom protein structure dataset for machine learning.

Proteins. 2021 Nov;89(11):1489-1496. doi: 10.1002/prot.26169. Epub 2021 Jul 12.

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.

PLoS Comput Biol. 2017 Jan 5;13(1):e1005324. doi: 10.1371/journal.pcbi.1005324. eCollection 2017 Jan.

Sequence alignment using machine learning for accurate template-based protein structure prediction.

Bioinformatics. 2020 Jan 1;36(1):104-111. doi: 10.1093/bioinformatics/btz483.

Prediction Enhancement of Residue Real-Value Relative Accessible Surface Area in Transmembrane Helical Proteins by Solving the Output Preference Problem of Machine Learning-Based Predictors.

J Chem Inf Model. 2015 Nov 23;55(11):2464-74. doi: 10.1021/acs.jcim.5b00246. Epub 2015 Oct 20.

Machine Learning Approaches for Quality Assessment of Protein Structures.

Biomolecules. 2020 Apr 17;10(4):626. doi: 10.3390/biom10040626.

Structural and Sequence Similarity Makes a Significant Impact on Machine-Learning-Based Scoring Functions for Protein-Ligand Interactions.

J Chem Inf Model. 2017 Apr 24;57(4):1007-1012. doi: 10.1021/acs.jcim.7b00049. Epub 2017 Apr 5.

Accurate contact predictions using covariation techniques and machine learning.

Proteins. 2016 Sep;84 Suppl 1(Suppl Suppl 1):145-51. doi: 10.1002/prot.24863. Epub 2015 Aug 14.

rawMSA: End-to-end Deep Learning using raw Multiple Sequence Alignments.

PLoS One. 2019 Aug 15;14(8):e0220182. doi: 10.1371/journal.pone.0220182. eCollection 2019.

引用本文的文献

Language Modelling Techniques for Analysing the Impact of Human Genetic Variation.

Bioinform Biol Insights. 2025 Sep 2;19:11779322251358314. doi: 10.1177/11779322251358314. eCollection 2025.

RC-GNN: A predictive model of enzyme-reaction pairs.

bioRxiv. 2025 Jun 27:2025.06.22.660952. doi: 10.1101/2025.06.22.660952.

Multi-stage attention-based extraction and fusion of protein sequence and structural features for protein function prediction.

Bioinformatics. 2025 Jun 26. doi: 10.1093/bioinformatics/btaf374.

Annotating the microbial dark matter with HiFi-NN.

iScience. 2025 Apr 18;28(6):112480. doi: 10.1016/j.isci.2025.112480. eCollection 2025 Jun 20.

Protein structure prediction via deep learning: an in-depth review.

Front Pharmacol. 2025 Apr 3;16:1498662. doi: 10.3389/fphar.2025.1498662. eCollection 2025.

Designing single-polymer-chain nanoparticles to mimic biomolecular hydration frustration.

Nat Chem. 2025 Mar 12. doi: 10.1038/s41557-025-01760-9.

How well do contextual protein encodings learn structure, function, and evolutionary context?

Cell Syst. 2025 Mar 19;16(3):101201. doi: 10.1016/j.cels.2025.101201. Epub 2025 Mar 4.

Holographic-(V)AE: An end-to-end SO(3)-equivariant (variational) autoencoder in Fourier space.

Phys Rev Res. 2024 Apr-Jun;6(2). doi: 10.1103/physrevresearch.6.023006. Epub 2024 Apr 1.

CryptoBench: cryptic protein-ligand binding sites dataset and benchmark.

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae745.

Linking Protein Stability to Pathogenicity: Predicting Clinical Significance of Single-Missense Mutations in Ocular Proteins Using Machine Learning.

Int J Mol Sci. 2024 Oct 30;25(21):11649. doi: 10.3390/ijms252111649.

本文引用的文献

End-to-End Differentiable Learning of Protein Structure.

Cell Syst. 2019 Apr 24;8(4):292-301.e3. doi: 10.1016/j.cels.2019.03.006. Epub 2019 Apr 17.

HMMER web server: 2018 update.

Nucleic Acids Res. 2018 Jul 2;46(W1):W200-W204. doi: 10.1093/nar/gky448.

AI for medical imaging goes deep.

Nat Med. 2018 May;24(5):539-540. doi: 10.1038/s41591-018-0029-3.

Alternative models for sharing confidential biomedical data.

Nat Biotechnol. 2018 May 9;36(5):391-392. doi: 10.1038/nbt.4128.

Opportunities and obstacles for deep learning in biology and medicine.

J R Soc Interface. 2018 Apr;15(141). doi: 10.1098/rsif.2017.0387.

UniProt: the universal protein knowledgebase.

Nucleic Acids Res. 2018 Mar 16;46(5):2699. doi: 10.1093/nar/gky092.

Continuous Automated Model EvaluatiOn (CAMEO) complementing the critical assessment of structure prediction in CASP12.

Proteins. 2018 Mar;86 Suppl 1(Suppl 1):387-398. doi: 10.1002/prot.25431. Epub 2017 Dec 17.

Critical assessment of methods of protein structure prediction (CASP)-Round XII.

Proteins. 2018 Mar;86 Suppl 1(Suppl 1):7-15. doi: 10.1002/prot.25415. Epub 2017 Dec 15.

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.

Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988. Epub 2017 Oct 16.

Co-evolution techniques are reshaping the way we do structural bioinformatics.

F1000Res. 2017 Jul 25;6:1224. doi: 10.12688/f1000research.11543.1. eCollection 2017.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ProteinNet：用于蛋白质结构机器学习的标准化数据集。

ProteinNet: a standardized data set for machine learning of protein structure.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献