• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

训练数据的组成会影响蛋白质结构分析算法的性能。

Training data composition affects performance of protein structure analysis algorithms.

机构信息

Biomedical Informatics Training Program, Stanford University, Stanford, CA 94305, USA.

出版信息

Pac Symp Biocomput. 2022;27:10-21.

PMID:34890132
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8669736/
Abstract

The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We Find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets.

摘要

蛋白质的三维结构对于理解其分子机制和相互作用至关重要。因此,能够学习蛋白质结构准确表示的机器学习算法有望在蛋白质工程和药物开发中发挥关键作用。这些模型在部署中的准确性直接受到训练数据质量的影响。不同的实验方法用于蛋白质结构测定可能会给训练数据带来偏差。在这项工作中,我们在三个不同的任务中评估了这种影响的大小:模型准确性估计、蛋白质序列设计和催化残基预测。大多数蛋白质结构来自 X 射线晶体学、核磁共振(NMR)或低温电子显微镜(cryo-EM);我们在数据集上训练了每个模型,这些数据集由所有三种结构类型或仅 X 射线数据组成。我们发现,在这些任务中,模型在来自 NMR 和 cryo-EM 的测试集上的表现始终不如来自 X 射线晶体学的测试集上的表现差,但当 NMR 和 cryo-EM 结构包含在训练集中时,这种差异可以减轻。重要的是,我们表明在训练集中包含所有三种类型的结构不会降低 X 射线结构的测试性能,在某些情况下甚至会提高它。最后,我们研究了模型性能与每种方法的生物物理特性之间的关系,并建议在组成训练集时应考虑感兴趣任务的生物化学特性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/5bb94e37ae08/nihms-1760592-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/519d5da2c680/nihms-1760592-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/e25c980d4921/nihms-1760592-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/f2900e196ee7/nihms-1760592-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/c76fca107d0a/nihms-1760592-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/eca9c0acfad9/nihms-1760592-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/5bb94e37ae08/nihms-1760592-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/519d5da2c680/nihms-1760592-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/e25c980d4921/nihms-1760592-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/f2900e196ee7/nihms-1760592-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/c76fca107d0a/nihms-1760592-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/eca9c0acfad9/nihms-1760592-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a739/8669736/5bb94e37ae08/nihms-1760592-f0006.jpg

相似文献

1
Training data composition affects performance of protein structure analysis algorithms.训练数据的组成会影响蛋白质结构分析算法的性能。
Pac Symp Biocomput. 2022;27:10-21.
2
Outlier Profiles of Atomic Structures Derived from X-ray Crystallography and from Cryo-Electron Microscopy.基于 X 射线晶体学和低温电子显微镜的原子结构的离群值分布。
Molecules. 2020 Mar 28;25(7):1540. doi: 10.3390/molecules25071540.
3
An Investigation of Atomic Structures Derived from X-ray Crystallography and Cryo-Electron Microscopy Using Distal Blocks of Side-Chains.利用侧链远端结构域研究源自 X 射线晶体学和低温电子显微镜的原子结构。
Molecules. 2018 Mar 8;23(3):610. doi: 10.3390/molecules23030610.
4
Using cryo-electron microscopy maps for X-ray structure determination of homologues.利用冷冻电子显微镜图谱进行同源蛋白的 X 射线结构测定。
Acta Crystallogr D Struct Biol. 2020 Jan 1;76(Pt 1):63-72. doi: 10.1107/S2059798319015924.
5
Computational models in the service of X-ray and cryo-electron microscopy structure determination.计算模型在 X 射线和冷冻电子显微镜结构测定中的应用。
Proteins. 2021 Dec;89(12):1633-1646. doi: 10.1002/prot.26223. Epub 2021 Sep 6.
6
Iterative Molecular Dynamics-Rosetta Membrane Protein Structure Refinement Guided by Cryo-EM Densities.由冷冻电镜密度引导的迭代分子动力学-罗塞塔膜蛋白结构优化
J Chem Theory Comput. 2017 Oct 10;13(10):5131-5145. doi: 10.1021/acs.jctc.7b00464. Epub 2017 Sep 26.
7
Cryo-EM reveals the structure and dynamics of a 723-residue malate synthase G.低温电子显微镜揭示了一个 723 个残基的苹果酸合酶 G 的结构和动态。
J Struct Biol. 2023 Jun;215(2):107958. doi: 10.1016/j.jsb.2023.107958. Epub 2023 Mar 28.
8
Protein structure determination by electron cryo-microscopy.通过电子冷冻显微镜确定蛋白质结构。
Curr Opin Pharmacol. 2009 Oct;9(5):636-42. doi: 10.1016/j.coph.2009.04.006. Epub 2009 May 22.
9
Determining the Crystal Structure of TRPV6确定瞬时受体电位香草酸亚型6(TRPV6)的晶体结构
10
Blind assessment of monomeric AlphaFold2 protein structure models with experimental NMR data.使用实验 NMR 数据对单体 AlphaFold2 蛋白质结构模型进行盲评估。
J Magn Reson. 2023 Jul;352:107481. doi: 10.1016/j.jmr.2023.107481. Epub 2023 May 20.

引用本文的文献

1
Revolutionizing Molecular Design for Innovative Therapeutic Applications through Artificial Intelligence.通过人工智能为创新治疗应用彻底改变分子设计。
Molecules. 2024 Sep 29;29(19):4626. doi: 10.3390/molecules29194626.
2
Prediction of mutation-induced protein stability changes based on the geometric representations learned by a self-supervised method.基于自监督方法学习到的几何表示来预测突变诱导的蛋白质稳定性变化。
BMC Bioinformatics. 2024 Aug 28;25(1):282. doi: 10.1186/s12859-024-05876-6.
3
The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review.

本文引用的文献

1
Structure-based protein function prediction using graph convolutional networks.基于结构的蛋白质功能预测使用图卷积网络。
Nat Commun. 2021 May 26;12(1):3168. doi: 10.1038/s41467-021-23303-9.
2
Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning.利用几何深度学习破译蛋白质分子表面的相互作用指纹。
Nat Methods. 2020 Feb;17(2):184-192. doi: 10.1038/s41592-019-0666-6. Epub 2019 Dec 9.
3
CryoEM maps are full of potential.冷冻电镜图谱蕴藏着无限可能。
用于评估医学中可信人工智能数据质量的METRIC框架:一项系统综述。
NPJ Digit Med. 2024 Aug 3;7(1):203. doi: 10.1038/s41746-024-01196-4.
4
Stabilization challenges and aggregation in protein-based therapeutics in the pharmaceutical industry.制药行业中基于蛋白质的治疗药物的稳定性挑战与聚集
RSC Adv. 2023 Dec 11;13(51):35947-35963. doi: 10.1039/d3ra06476j. eCollection 2023 Dec 8.
5
COLLAPSE: A representation learning framework for identification and characterization of protein structural sites.崩溃:用于鉴定和描述蛋白质结构位点的表示学习框架。
Protein Sci. 2023 Feb;32(2):e4541. doi: 10.1002/pro.4541.
Curr Opin Struct Biol. 2019 Oct;58:214-223. doi: 10.1016/j.sbi.2019.04.006. Epub 2019 Aug 7.
4
Prediction of disulfide dihedral angles using chemical shifts.利用化学位移预测二硫键二面角
Chem Sci. 2018 Jul 5;9(31):6548-6556. doi: 10.1039/c8sc01423j. eCollection 2018 Aug 21.
5
SPIN2: Predicting sequence profiles from protein structures using deep neural networks.SPIN2:使用深度神经网络从蛋白质结构预测序列特征。
Proteins. 2018 Jun;86(6):629-633. doi: 10.1002/prot.25489. Epub 2018 Mar 25.
6
The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes.催化位点图集 2.0:对酶中鉴定的催化位点和残基进行编目。
Nucleic Acids Res. 2014 Jan;42(Database issue):D485-9. doi: 10.1093/nar/gkt1243. Epub 2013 Dec 6.
7
Solution NMR refinement of a metal ion bound protein using metal ion inclusive restrained molecular dynamics methods.使用包含金属离子的约束分子动力学方法对金属离子结合蛋白进行溶液 NMR 精修。
J Biomol NMR. 2013 Jun;56(2):125-37. doi: 10.1007/s10858-013-9729-7. Epub 2013 Apr 23.
8
CD-HIT: accelerated for clustering the next-generation sequencing data.CD-HIT:用于加速下一代测序数据聚类的工具。
Bioinformatics. 2012 Dec 1;28(23):3150-2. doi: 10.1093/bioinformatics/bts565. Epub 2012 Oct 11.
9
Solvent accessible surface area of amino acid residues in globular proteins: correlation of apparent transfer free energies with experimental hydrophobicity scales.球状蛋白质中氨基酸残基的溶剂可及表面积:表观转移自由能与实验疏水性标度的相关性
Biomacromolecules. 2009 May 11;10(5):1224-37. doi: 10.1021/bm8015169.
10
Search for allosteric disulfide bonds in NMR structures.在核磁共振结构中寻找变构二硫键。
BMC Struct Biol. 2007 Jul 20;7:49. doi: 10.1186/1472-6807-7-49.