懒惰采样和线性采样：RNA二级结构的快速随机采样及其在SARS-CoV-2中的应用

LazySampling and LinearSampling: Fast Stochastic Sampling of RNA Secondary Structure with Applications to SARS-CoV-2.

作者信息

Zhang He, Zhang Liang, Li Sizhen, Mathews David H, Huang Liang

出版信息

bioRxiv. 2021 Nov 24:2020.12.29.424617. doi: 10.1101/2020.12.29.424617.

DOI:10.1101/2020.12.29.424617

PMID:33398265

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7781300/

Abstract

Many RNAs fold into multiple structures at equilibrium. The classical stochastic sampling algorithm can sample secondary structures according to their probabilities in the Boltzmann ensemble, and is widely used. However, this algorithm, consisting of a bottom-up partition function phase followed by a top-down sampling phase, suffers from three limitations: (a) the formulation and implementation of the sampling phase are unnecessarily complicated; (b) the sampling phase repeatedly recalculates many redundant recursions already done during the partition function phase; (c) the partition function runtime scales cubically with the sequence length. These issues prevent stochastic sampling from being used for very long RNAs such as the full genomes of SARS-CoV-2. To address these problems, we first adopt a hypergraph framework under which the sampling algorithm can be greatly simplified. We then present three sampling algorithms under this framework, among which the LazySampling algorithm is the fastest by eliminating redundant work in the sampling phase via on-demand caching. Based on LazySampling, we further replace the cubic-time partition function by a linear-time approximate one, and derive LinearSampling, an end-to-end linear-time sampling algorithm that is orders of magnitude faster than the standard one. For instance, LinearSampling is 176Ã- faster (38.9s vs. 1.9h) than Vienna RNAsubopt on the full genome of Ebola virus (18,959 ). More importantly, LinearSampling is the first RNA structure sampling algorithm to scale up to the full-genome of SARS-CoV-2 without local window constraints, taking only 69.2 seconds on its reference sequence (29,903 ). The resulting sample correlates well with the experimentally-guided structures. On the SARS-CoV-2 genome, LinearSampling finds 23 regions of 15 with high accessibilities, which are potential targets for COVID-19 diagnostics and drug design. See code: https://github.com/LinearFold/LinearSampling.

摘要

许多RNA在平衡状态下会折叠成多种结构。经典的随机抽样算法可以根据玻尔兹曼系综中二级结构的概率对其进行抽样，并且被广泛使用。然而，该算法由一个自底向上的配分函数阶段和一个自顶向下的抽样阶段组成，存在三个局限性：（a）抽样阶段的公式化和实现不必要地复杂；（b）抽样阶段会重复重新计算许多在配分函数阶段已经完成的冗余递归；（c）配分函数的运行时间与序列长度的立方成正比。这些问题使得随机抽样无法用于非常长的RNA，如严重急性呼吸综合征冠状病毒2（SARS-CoV-2）的全基因组。为了解决这些问题，我们首先采用一种超图框架，在此框架下抽样算法可以大大简化。然后我们在此框架下提出了三种抽样算法，其中LazySampling算法通过按需缓存消除抽样阶段的冗余工作，是最快的。基于LazySampling，我们进一步用一个线性时间近似算法取代立方时间的配分函数，并推导出LinearSampling，这是一种端到端的线性时间抽样算法，比标准算法快几个数量级。例如，在埃博拉病毒全基因组（18,959个碱基）上，LinearSampling比Vienna RNAsubopt快176倍（38.9秒对1.9小时）。更重要的是，LinearSampling是第一种无需局部窗口约束就能扩展到SARS-CoV-2全基因组的RNA结构抽样算法，在其参考序列（29,9