用于结构预测的千万亿次同源性搜索

Petascale Homology Search for Structure Prediction.

作者信息

Lee Sewon, Kim Gyuri, Karin Eli Levy, Mirdita Milot, Park Sukhwan, Chikhi Rayan, Babaian Artem, Kryshtafovych Andriy, Steinegger Martin

机构信息

School of Biological Sciences, Seoul National University, Seoul 08826, South Korea.

ELKMO, Copenhagen 2720, Denmark.

出版信息

bioRxiv. 2023 Jul 11:2023.07.10.548308. doi: 10.1101/2023.07.10.548308.

DOI:10.1101/2023.07.10.548308

PMID:37503235

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10369885/

Abstract

The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.

摘要

最近的第15届蛋白质结构预测关键评估（CASP15）竞赛突出了多序列比对（MSA）在蛋白质结构预测中的关键作用，基于顶级AlphaFold2的预测方法的成功就证明了这一点。为了拓展MSA的应用范围，我们对序列读取存档库（SRA）进行了PB级规模的搜索，为CASP15的目标生成了数GB的比对同源序列。这些序列与ColabFold-search生成的默认MSA合并，并提供给ColabFold-predict。通过使用SRA数据，我们对66%的非简单目标实现了高精度预测（全局距离测试总分（GDT_TS）> 70），而使用ColabFold-search的默认MSA时，这一比例仅为52%。接下来，我们测试了深度同源搜索以及ColabFold的高级功能（如更多循环次数）对预测准确性的影响。虽然SRA同源序列对于将ColabFold在CASP15中的排名从第11位提升至第3位最为关键，但其他策略也发挥了作用。我们在现有提高预测的策略背景下对这些进行了分析。