Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Department of Genetics, Stanford University, Stanford, CA 94305, USA.
Cell Syst. 2017 Sep 27;5(3):230-236.e5. doi: 10.1016/j.cels.2017.07.006.
Sequence libraries that cover all k-mers enable universal, unbiased measurements of binding to both oligonucleotides and peptides. While the number of k-mers grows exponentially in k, space on all experimental platforms is limited. Here, we shrink k-mer library sizes by using joker characters, which represent all characters in the alphabet simultaneously. We present the JokerCAKE (joker covering all k-mers) algorithm for generating a short sequence such that each k-mer appears at least p times with at most one joker character per k-mer. By running our algorithm on a range of parameters and alphabets, we show that JokerCAKE produces near-optimal sequences. Moreover, through comparison with data from hundreds of DNA-protein binding experiments and with new experimental results for both standard and JokerCAKE libraries, we establish that accurate binding scores can be inferred for high-affinity k-mers using JokerCAKE libraries. JokerCAKE libraries allow researchers to search a significantly larger sequence space using the same number of experimental measurements and at the same cost.
序列文库涵盖所有 k -mer,可实现对寡核苷酸和肽的通用、无偏测量。尽管 k-mer 的数量随 k 呈指数增长,但所有实验平台的空间都有限。在这里,我们使用万能字符来缩小 k-mer 文库的大小,这些字符可以同时表示字母表中的所有字符。我们提出了 JokerCAKE(万能涵盖所有 k-mer)算法,用于生成一个短序列,使得每个 k-mer 至少出现 p 次,每个 k-mer 最多使用一个万能字符。通过在一系列参数和字母表上运行我们的算法,我们表明 JokerCAKE 生成了接近最优的序列。此外,通过与数百个 DNA-蛋白质结合实验的数据以及 JokerCAKE 文库的新实验结果进行比较,我们建立了使用 JokerCAKE 文库可以为高亲和力 k-mer 推断出准确的结合分数。JokerCAKE 文库允许研究人员在使用相同数量的实验测量和相同成本的情况下,搜索更大的序列空间。