Tang Wei Yu, Dai Ning, Zhou Tianshuo, Mathews David H, Huang Liang
School of EECS.
Dept. of Quantitative and Computational Biology, University of Southern California.
ArXiv. 2024 Dec 11:arXiv:2412.08751v1.
The task of RNA design given a target structure aims to find a sequence that can fold into that structure. It is a computationally hard problem where some version(s) have been proven to be NP-hard. As a result, heuristic methods such as local search have been popular for this task, but by only exploring a fixed number of candidates. They can not keep up with the exponential growth of the design space, and often perform poorly on longer and harder-to-design structures. We instead formulate these discrete problems as continuous optimization, which starts with a distribution over all possible candidate sequences, and uses gradient descent to improve the expectation of an objective function. We define novel distributions based on coupled variables to rule out invalid sequences given the target structure and to model the correlation between nucleotides. To make it universally applicable to any objective function, we use sampling to approximate the expected objective function, to estimate the gradient, and to select the final candidate. Compared to state-of-the-art methods, our work consistently outperforms them in key metrics such as Boltzmann probability, ensemble defect, and energy gap, especially on long and hard-to-design puzzles in the Eterna100 benchmark. Our code is available at: http://github.com/weiyutang1010/ncrna_design.
给定目标结构的RNA设计任务旨在找到一个能够折叠成该结构的序列。这是一个计算难题,其中某些版本已被证明是NP难问题。因此,诸如局部搜索之类的启发式方法在该任务中很受欢迎,但它们只探索固定数量的候选序列。它们无法跟上设计空间的指数增长,并且在处理更长且更难设计的结构时通常表现不佳。相反,我们将这些离散问题表述为连续优化问题,该问题从所有可能候选序列的分布开始,并使用梯度下降来改进目标函数的期望值。我们基于耦合变量定义新颖的分布,以排除给定目标结构下的无效序列,并对核苷酸之间的相关性进行建模。为了使其普遍适用于任何目标函数,我们使用采样来近似期望的目标函数、估计梯度并选择最终候选序列。与现有方法相比,我们的工作在诸如玻尔兹曼概率、系综缺陷和能隙等关键指标上始终优于它们,特别是在Eterna100基准测试中的长且难设计的谜题上。我们的代码可在以下网址获取:http://github.com/weiyutang1010/ncrna_design 。