Reddy Aniketh Janardhan, Geng Xinyang, Herschl Michael H, Kolli Sathvik, Kumar Aviral, Hsu Patrick D, Levine Sergey, Ioannidis Nilah M
University of California, Berkeley.
bioRxiv. 2024 Jun 23:2024.06.23.600232. doi: 10.1101/2024.06.23.600232.
Gene therapies have the potential to treat disease by delivering therapeutic genetic cargo to disease-associated cells. One limitation to their widespread use is the lack of short regulatory sequences, or promoters, that differentially induce the expression of delivered genetic cargo in target cells, minimizing side effects in other cell types. Such cell-type-specific promoters are difficult to discover using existing methods, requiring either manual curation or access to large datasets of promoter-driven expression from both targeted and untargeted cells. Model-based optimization (MBO) has emerged as an effective method to design biological sequences in an automated manner, and has recently been used in promoter design methods. However, these methods have only been tested using large training datasets that are expensive to collect, and focus on designing promoters for markedly different cell types, overlooking the complexities associated with designing promoters for closely related cell types that share similar regulatory features. Therefore, we introduce a comprehensive framework for utilizing MBO to design promoters in a data-efficient manner, with an emphasis on discovering promoters for similar cell types. We use conservative objective models (COMs) for MBO and highlight practical considerations such as best practices for improving sequence diversity, getting estimates of model uncertainty, and choosing the optimal set of sequences for experimental validation. Using three relatively similar blood cancer cell lines (Jurkat, K562, and THP1), we show that our approach discovers many novel cell-type-specific promoters after experimentally validating the designed sequences. For K562 cells, in particular, we discover a promoter that has 75.85% higher cell-type-specificity than the best promoter from the initial dataset used to train our models.
基因疗法有潜力通过将治疗性遗传物质传递到与疾病相关的细胞来治疗疾病。其广泛应用的一个限制是缺乏短调控序列或启动子,这些序列能够在靶细胞中差异诱导所传递遗传物质的表达,同时将对其他细胞类型的副作用降至最低。使用现有方法很难发现这种细胞类型特异性启动子,这需要人工筛选或获取来自靶向和非靶向细胞的启动子驱动表达的大型数据集。基于模型的优化(MBO)已成为一种以自动化方式设计生物序列的有效方法,并且最近已用于启动子设计方法中。然而,这些方法仅使用收集成本高昂的大型训练数据集进行了测试,并且专注于为明显不同的细胞类型设计启动子,而忽略了为具有相似调控特征的密切相关细胞类型设计启动子所涉及的复杂性。因此,我们引入了一个全面的框架,利用MBO以数据高效的方式设计启动子,重点是发现相似细胞类型的启动子。我们将保守目标模型(COM)用于MBO,并强调了一些实际考虑因素,例如提高序列多样性的最佳实践、获取模型不确定性的估计以及选择用于实验验证的最佳序列集。使用三种相对相似的血癌细胞系(Jurkat、K562和THP1),我们表明,在对设计序列进行实验验证后,我们的方法发现了许多新型细胞类型特异性启动子。特别是对于K细胞,我们发现了一个启动子,其细胞类型特异性比用于训练我们模型的初始数据集中的最佳启动子高75.85%。 562