Consorzio Interuniversitario di Risonanze Magnetiche di Metallo Proteine, Via Luigi Sacconi 6, 50019 Sesto Fiorentino, Italy.
Institute for Drug Discovery, Leipzig University, Brüderstr. 34, 04103 Leipzig, Germany.
J Chem Inf Model. 2022 Jun 27;62(12):2951-2960. doi: 10.1021/acs.jcim.2c00522. Epub 2022 Jun 9.
Thirty-eight percent of protein structures in the Protein Data Bank contain at least one metal ion. However, not all these metal sites are biologically relevant. Cations present as impurities during sample preparation or in the crystallization buffer can cause the formation of protein-metal complexes that do not exist in vivo. We implemented a deep learning approach to build a classifier able to distinguish between physiological and adventitious zinc-binding sites in the 3D structures of metalloproteins. We trained the classifier using manually annotated sites extracted from the MetalPDB database. Using a 10-fold cross validation procedure, the classifier achieved an accuracy of about 90%. The same neural classifier could predict the physiological relevance of non-heme mononuclear iron sites with an accuracy of nearly 80%, suggesting that the rules learned on zinc sites have general relevance. By quantifying the relative importance of the features describing the input zinc sites from the network perspective and by analyzing the characteristics of the MetalPDB datasets, we inferred some common principles. Physiological sites present a low solvent accessibility of the aminoacids forming coordination bonds with the metal ion (the metal ligands), a relatively large number of residues in the metal environment (≥20), and a distinct pattern of conservation of Cys and His residues in the site. Adventitious sites, on the other hand, tend to have a low number of donor atoms from the polypeptide chain (often one or two). These observations support the evaluation of the physiological relevance of novel metal-binding sites in protein structures.
蛋白质数据库中 38%的蛋白质结构含有至少一个金属离子。然而,并非所有这些金属位点都具有生物学意义。在样品制备或结晶缓冲液中作为杂质存在的阳离子可能导致体内不存在的蛋白质-金属复合物的形成。我们采用深度学习方法构建了一个分类器,能够区分金属蛋白三维结构中生理和偶然锌结合位点。我们使用从 MetalPDB 数据库中提取的手动注释位点来训练分类器。通过 10 倍交叉验证程序,分类器的准确率约为 90%。相同的神经分类器可以预测非血红素单核铁位点的生理相关性,准确率接近 80%,表明在锌位点上学习的规则具有普遍意义。通过从网络角度量化描述输入锌位点的特征的相对重要性,并分析 MetalPDB 数据集的特征,我们推断出一些共同的原则。生理位点的金属离子配位键形成氨基酸(金属配体)的溶剂可及性较低,金属环境中的残基数相对较多(≥20),且位点中半胱氨酸和组氨酸残基的保守模式明显。另一方面,偶然位点往往来自多肽链的供体原子数量较少(通常为一个或两个)。这些观察结果支持在蛋白质结构中评估新的金属结合位点的生理相关性。