Mondal Dibyendu, Kumar Vipul, Satler Tadej, Ramachandran Rakesh, Saltzberg Daniel, Chemmama Ilan, Pilla Kala Bharath, Echeverria Ignacia, Webb Benjamin M, Gupta Meghna, Verba Kliment A, Sali Andrej
bioRxiv. 2025 May 2:2024.12.10.627859. doi: 10.1101/2024.12.10.627859.
Building an accurate atomic structure model of a protein into a cryo-electron microscopy (cryo-EM) map at worse than 3 Angstrom resolution is difficult. To facilitate this task, we devised a method for assigning the amino acid residue sequence to the backbone fragments traced in an input cryo-EM map (EMSequenceFinder). EMSequenceFinder relies on a Bayesian scoring function for ranking 20 standard amino acid residue types at a given backbone position, based on the fit to a density map, map resolution, and secondary structure propensity. The fit to a density is quantified by a convolutional neural network that was trained on ~5.56 million amino acid residue densities extracted from cryo-EM maps at 3-10 Angstrom resolution and corresponding atomic structure models deposited in the Electron Microscopy Data Bank (EMDB). We benchmarked EMSequenceFinder by predicting the sequences of 58,044 distinct ɑ-helix and β-strand fragments, given the fragment backbone coordinates fitted in their density maps. EMSequenceFinder identifies the correct sequence as the best-scoring sequence in 77.8% of these cases. We also assessed EMSequenceFinder on separate datasets of cryo-EM maps at resolutions from 4 to 6 Å. The accuracy of EMSequenceFinder (63.5%) was better than that of three tested state-of-the-art methods, including findMysequence (45%), ModelAngelo (27%), and sequence_from_map in Phenix (12.9%). We further illustrate EMSequenceFinder by threading the SARS-CoV-2 NSP2 sequence into eight cryo-EM maps at resolutions from 3.7 to 7.0 Angstrom. EMSequenceFinder is implemented in our open-source Integrative Modeling Platform (IMP) program. Thus, it is expected to be helpful for integrative structure modeling based on a cryo-EM map and other information, such as models of protein complex components and chemical crosslinks between them. EMSequenceFinder is available as part of our open source IMP distribution at https://integrativemodeling.org/.
在分辨率低于3埃的冷冻电子显微镜(cryo-EM)图谱中构建准确的蛋白质原子结构模型是困难的。为便于完成这项任务,我们设计了一种方法,用于将氨基酸残基序列分配到输入冷冻电镜图谱中追踪到的主链片段上(EMSequenceFinder)。EMSequenceFinder依赖于一种贝叶斯评分函数,该函数基于与密度图的拟合度、图谱分辨率和二级结构倾向,对给定主链位置的20种标准氨基酸残基类型进行排名。与密度的拟合度通过一个卷积神经网络进行量化,该网络是在从3-10埃分辨率的冷冻电镜图谱中提取的约556万个氨基酸残基密度以及沉积在电子显微镜数据库(EMDB)中的相应原子结构模型上训练得到的。我们通过预测58044个不同的α螺旋和β链片段的序列对EMSequenceFinder进行了基准测试,这些片段的主链坐标已拟合到它们的密度图中。在这些案例中,EMSequenceFinder在77.8%的情况下将正确序列识别为得分最高的序列。我们还在分辨率为4至6埃的冷冻电镜图谱的单独数据集上评估了EMSequenceFinder。EMSequenceFinder的准确率(63.5%)高于三种经过测试的最先进方法,包括findMysequence(45%)、ModelAngelo(27%)和Phenix中的sequence_from_map(12.9%)。我们通过将严重急性呼吸综合征冠状病毒2(SARS-CoV-2)NSP2序列穿入分辨率为3.7至7.0埃的八个冷冻电镜图谱中,进一步展示了EMSequenceFinder。EMSequenceFinder是在我们的开源综合建模平台(IMP)程序中实现的。因此,预计它将有助于基于冷冻电镜图谱和其他信息(如蛋白质复合物组分模型及其之间的化学交联)进行综合结构建模。EMSequenceFinder可作为我们开源IMP发行版的一部分,在https://integrativemodeling.org/上获取。