Ayoub Ronald, Lee Yugyung
School of Computing and Engineering, University of Missouri at Kansas City, Kansas City, Missouri, USA.
Proteins. 2021 Jun;89(6):648-658. doi: 10.1002/prot.26048. Epub 2021 Feb 2.
Protein structure prediction is a long-standing unsolved problem in molecular biology that has seen renewed interest with the recent success of deep learning with AlphaFold at CASP13. While developing and evaluating protein structure prediction methods, researchers may want to identify the most similar known structures to their predicted structures. These predicted structures often have low sequence and structure similarity to known structures. We show how RUPEE, a purely geometric protein structure search, is able to identify the structures most similar to structure predictions, regardless of how they vary from known structures, something existing protein structure searches struggle with. RUPEE accomplishes this through the use of a novel linear encoding of protein structures as a sequence of residue descriptors. Using a fast Needleman-Wunsch algorithm, RUPEE is able to perform alignments on the sequences of residue descriptors for every available structure. This is followed by a series of increasingly accurate structure alignments from TM-align alignments initialized with the Needleman-Wunsch residue descriptor alignments to standard TM-align alignments of the final results. By using alignment normalization effectively at each stage, RUPEE also can execute containment searches in addition to full-length searches to identify structural motifs within proteins. We compare the results of RUPEE to the protein structure searches mTM-align, SSM, CATHEDRAL, and VAST using a benchmark derived from the protein structure predictions submitted to CASP13. RUPEE identifies better alignments on average with respect to TM-score as well as scores specific to SSM and CATHEDRAL, Q-score and SSAP-score, respectively.
蛋白质结构预测是分子生物学中一个长期未解决的问题,随着深度学习在第13届蛋白质结构预测关键评估(CASP13)中借助AlphaFold取得的最新成功,该问题再次受到关注。在开发和评估蛋白质结构预测方法时,研究人员可能希望识别与其预测结构最相似的已知结构。这些预测结构通常与已知结构的序列和结构相似性较低。我们展示了RUPEE(一种纯几何蛋白质结构搜索方法)如何能够识别与结构预测最相似的结构,无论它们与已知结构有多大差异,而这是现有蛋白质结构搜索方法所难以做到的。RUPEE通过使用一种将蛋白质结构作为残基描述符序列的新型线性编码来实现这一点。利用快速的Needleman-Wunsch算法,RUPEE能够对每个可用结构的残基描述符序列进行比对。接下来是一系列越来越精确的结构比对,从以Needleman-Wunsch残基描述符比对初始化的TM-align比对到最终结果的标准TM-align比对。通过在每个阶段有效地使用比对归一化,RUPEE除了能进行全长搜索外,还可以执行包含搜索以识别蛋白质中的结构基序。我们使用源自提交给CASP13的蛋白质结构预测的基准,将RUPEE的结果与蛋白质结构搜索方法mTM-align、SSM、CATHEDRAL和VAST的结果进行比较。就TM分数以及分别特定于SSM和CATHEDRAL的分数(Q分数和SSAP分数)而言,RUPEE平均能识别出更好的比对结果。