使用 BioNeMo 折叠人类蛋白质组：用于机器学习目的的结构模型融合数据集。

Folding the human proteome using BioNeMo: A fused dataset of structural models for machine learning purposes.

机构信息

Innophore, San Francisco, CA, USA.

NVIDIA, Santa Clara, CA, USA.

出版信息

Sci Data. 2024 Jun 6;11(1):591. doi: 10.1038/s41597-024-03403-z.

DOI:10.1038/s41597-024-03403-z

PMID:38844754

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11156891/

Abstract

Human proteins are crucial players in both health and disease. Understanding their molecular landscape is a central topic in biological research. Here, we present an extensive dataset of predicted protein structures for 42,042 distinct human proteins, including splicing variants, derived from the UniProt reference proteome UP000005640. To ensure high quality and comparability, the dataset was generated by combining state-of-the-art modeling-tools AlphaFold 2, OpenFold, and ESMFold, provided within NVIDIA's BioNeMo platform, as well as homology modeling using Innophore's CavitomiX platform. Our dataset is offered in both unedited and edited formats for diverse research requirements. The unedited version contains structures as generated by the different prediction methods, whereas the edited version contains refinements, including a dataset of structures without low prediction-confidence regions and structures in complex with predicted ligands based on homologs in the PDB. We are confident that this dataset represents the most comprehensive collection of human protein structures available today, facilitating diverse applications such as structure-based drug design and the prediction of protein function and interactions.

摘要

人类蛋白质在健康和疾病中都起着至关重要的作用。了解它们的分子结构是生物学研究的核心课题。在这里，我们提供了一个包含 42042 个人类蛋白质的预测蛋白质结构的广泛数据集，包括剪接变体，源自 UniProt 参考蛋白质组 UP000005640。为了确保高质量和可比性，该数据集是通过组合 NVIDIA 的 BioNeMo 平台中提供的最先进的建模工具 AlphaFold 2、OpenFold 和 ESMFold，以及使用 Innophore 的 CavitomiX 平台进行同源建模生成的。我们的数据集提供了未经编辑和编辑的格式，以满足不同的研究需求。未经编辑的版本包含由不同预测方法生成的结构，而编辑后的版本包含改进，包括一个不包含低预测置信区域的结构数据集，以及基于 PDB 中同源物预测配体的结构。我们有信心，这个数据集代表了目前最全面的人类蛋白质结构集合，促进了各种应用，如基于结构的药物设计和蛋白质功能和相互作用的预测。