Biomedical Engineering Department, Boston University, Boston, Massachusetts, United States of America.
Biological Design Center, Boston University, Boston, Massachusetts, United States of America.
PLoS Comput Biol. 2024 Jul 24;20(7):e1012276. doi: 10.1371/journal.pcbi.1012276. eCollection 2024 Jul.
Dense arrangements of binding sites within nucleotide sequences can collectively influence downstream transcription rates or initiate biomolecular interactions. For example, natural promoter regions can harbor many overlapping transcription factor binding sites that influence the rate of transcription initiation. Despite the prevalence of overlapping binding sites in nature, rapid design of nucleotide sequences with many overlapping sites remains a challenge. Here, we show that this is an NP-hard problem, coined here as the nucleotide String Packing Problem (SPP). We then introduce a computational technique that efficiently assembles sets of DNA-protein binding sites into dense, contiguous stretches of double-stranded DNA. For the efficient design of nucleotide sequences spanning hundreds of base pairs, we reduce the SPP to an Orienteering Problem with integer distances, and then leverage modern integer linear programming solvers. Our method optimally packs sets of 20-100 binding sites into dense nucleotide arrays of 50-300 base pairs in 0.05-10 seconds. Unlike approximation algorithms or meta-heuristics, our approach finds provably optimal solutions. We demonstrate how our method can generate large sets of diverse sequences suitable for library generation, where the frequency of binding site usage across the returned sequences can be controlled by modulating the objective function. As an example, we then show how adding additional constraints, like the inclusion of sequence elements with fixed positions, allows for the design of bacterial promoters. The nucleotide string packing approach we present can accelerate the design of sequences with complex DNA-protein interactions. When used in combination with synthesis and high-throughput screening, this design strategy could help interrogate how complex binding site arrangements impact either gene expression or biomolecular mechanisms in varied cellular contexts.
核苷酸序列中结合位点的密集排列可以共同影响下游转录速率或启动生物分子相互作用。例如,天然启动子区域可以包含许多重叠的转录因子结合位点,这些结合位点影响转录起始的速率。尽管重叠结合位点在自然界中很普遍,但设计具有许多重叠位点的核苷酸序列仍然具有挑战性。在这里,我们表明这是一个 NP 难问题,我们称之为核苷酸字符串打包问题(SPP)。然后,我们引入了一种计算技术,该技术可以有效地将 DNA-蛋白质结合位点集组装成密集、连续的双链 DNA 片段。对于跨越数百个碱基对的核苷酸序列的高效设计,我们将 SPP 简化为具有整数距离的定向旅行问题,然后利用现代整数线性规划求解器。我们的方法可以在 0.05-10 秒内将 20-100 个结合位点集高效地打包到 50-300 个碱基对的密集核苷酸阵列中。与近似算法或启发式算法不同,我们的方法可以找到可证明的最优解。我们展示了如何使用我们的方法生成适合文库生成的大量不同序列,通过调节目标函数,可以控制返回序列中结合位点使用的频率。作为一个例子,我们展示了如何添加其他约束条件,例如包含固定位置的序列元素,从而设计细菌启动子。我们提出的核苷酸字符串打包方法可以加速具有复杂 DNA-蛋白质相互作用的序列设计。当与合成和高通量筛选结合使用时,这种设计策略可以帮助研究复杂的结合位点排列如何影响不同细胞环境中的基因表达或生物分子机制。