Li Sizhen, Zhang He, Zhang Liang, Liu Kaibo, Liu Boxiang, Mathews David H, Huang Liang
School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR.
Baidu Research, Sunnyvale, CA.
bioRxiv. 2021 Nov 15:2020.11.23.393488. doi: 10.1101/2020.11.23.393488.
The constant emergence of COVID-19 variants reduces the effectiveness of existing vaccines and test kits. Therefore, it is critical to identify conserved structures in SARS-CoV-2 genomes as potential targets for variant-proof diagnostics and therapeutics. However, the algorithms to predict these conserved structures, which simultaneously fold and align multiple RNA homologs, scale at best cubically with sequence length, and are thus infeasible for coronaviruses, which possess the longest genomes (∼30,000 ) among RNA viruses. As a result, existing efforts on modeling SARS-CoV-2 structures resort to single sequence folding as well as local folding methods with short window sizes, which inevitably neglect long-range interactions that are crucial in RNA functions. Here we present LinearTurboFold, an efficient algorithm for folding RNA homologs that scales with sequence length, enabling unprecedented structural analysis on SARS-CoV-2. Surprisingly, on a group of SARS-CoV-2 and SARS-related genomes, LinearTurbo-Fold's purely prediction not only is close to experimentally-guided models for local structures, but also goes far beyond them by capturing the end-to-end pairs between 5' and 3' UTRs (∼29,800 apart) that match perfectly with a purely experimental work. Furthermore, LinearTurboFold identifies novel conserved structures and conserved accessible regions as potential targets for designing efficient and mutation-insensitive small-molecule drugs, antisense oligonucleotides, siRNAs, CRISPR-Cas13 guide RNAs and RT-PCR primers. LinearTurboFold is a general technique that can also be applied to other RNA viruses and full-length genome studies, and will be a useful tool in fighting the current and future pandemics.
Conserved RNA structures are critical for designing diagnostic and therapeutic tools for many diseases including COVID-19. However, existing algorithms are much too slow to model the global structures of full-length RNA viral genomes. We present LinearTurboFold, a linear-time algorithm that is orders of magnitude faster, making it the first method to simultaneously fold and align whole genomes of SARS-CoV-2 variants, the longest known RNA virus (∼30 kilobases). Our work enables unprecedented global structural analysis and captures long-range interactions that are out of reach for existing algorithms but crucial for RNA functions. LinearTurboFold is a general technique for full-length genome studies and can help fight the current and future pandemics.
新冠病毒变体的不断出现降低了现有疫苗和检测试剂盒的有效性。因此,识别严重急性呼吸综合征冠状病毒2(SARS-CoV-2)基因组中的保守结构作为抗变体诊断和治疗的潜在靶点至关重要。然而,预测这些保守结构的算法,即同时折叠和比对多个RNA同源物的算法,其运算量充其量与序列长度呈立方关系,因此对于在RNA病毒中拥有最长基因组(约30000个碱基)的冠状病毒来说是不可行的。结果,现有的对SARS-CoV-2结构进行建模的努力采用单序列折叠以及短窗口大小的局部折叠方法,这不可避免地忽略了对RNA功能至关重要的长程相互作用。在此,我们展示了LinearTurboFold,一种用于折叠RNA同源物的高效算法,其运算量与序列长度呈线性关系,从而能够对SARS-CoV-2进行前所未有的结构分析。令人惊讶的是,对于一组SARS-CoV-2和与SARS相关的基因组,LinearTurboFold的纯预测不仅与局部结构的实验指导模型相近,而且通过捕捉5'和3'非翻译区(UTR)之间的端对端配对(相隔约29800个碱基)远远超越了这些模型,这些配对与一项纯实验工作完美匹配。此外,LinearTurboFold识别出新型保守结构和保守可及区域,作为设计高效且对突变不敏感的小分子药物、反义寡核苷酸、小干扰RNA(siRNA)、CRISPR-Cas13导向RNA和逆转录聚合酶链反应(RT-PCR)引物的潜在靶点。LinearTurboFold是一种通用技术,也可应用于其他RNA病毒和全长基因组研究,将成为应对当前及未来大流行的有用工具。
保守的RNA结构对于设计包括新冠病毒疾病在内的多种疾病的诊断和治疗工具至关重要。然而,现有算法对全长RNA病毒基因组的全局结构进行建模的速度太慢。我们展示了LinearTurboFold,一种线性时间算法,其速度要快几个数量级,使其成为第一种同时折叠和比对SARS-CoV-2变体(已知最长的RNA病毒,约30千碱基)全基因组的方法。我们的工作实现了结前所未有的全局结构分析,并捕捉到现有算法无法触及但对RNA功能至关重要的长程相互作用。LinearTurboFold是一种用于全长基因组研究的通用技术,有助于应对当前及未来的大流行。