Moussa Siba, Kilgour Michael, Jans Clara, Hernandez-Garcia Alex, Cuperlovic-Culf Miroslava, Bengio Yoshua, Simine Lena
Department of Chemistry, McGill University, 801 Sherbrooke Street West, Montreal, QuebecH3A 0B8, Canada.
Montreal Institute for Learning Algorithms, 6666 St. Urbain, #200, Montreal, QuebecH2S 3H1, Canada.
J Phys Chem B. 2023 Jan 12;127(1):62-68. doi: 10.1021/acs.jpcb.2c05660. Epub 2022 Dec 27.
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria. Relevant criteria may be, for example, the presence of specific folding motifs, binding to molecular ligands, sensing properties, and so on. Most practical approaches to aptamer design identify a small set of promising candidate sequences using high-throughput experiments (e.g., SELEX) and then optimize performance by introducing only minor modifications to the empirically found candidates. Sequences that possess the desired properties but differ drastically in chemical composition will add diversity to the search space and facilitate the discovery of useful nucleic acid aptamers. Systematic diversification protocols are needed. Here we propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity. We start by training a Potts model using the maximum entropy principle on a small set of empirically identified sequences unified by a common feature. To generate new candidate sequences with a controllable degree of diversity, we take advantage of the model's spectral feature: an "energy" bandgap separating sequences that are similar to the training set from those that are distinct. By controlling the Potts energy range that is sampled, we generate sequences that are distinct from the training set yet still likely to have the encoded features. To demonstrate performance, we apply our approach to design diverse pools of sequences with specified secondary structure motifs in 30-mer RNA and DNA aptamers.
短单链RNA和DNA序列(适体)的反向设计是寻找满足一组期望标准的序列的任务。相关标准例如可以是特定折叠基序的存在、与分子配体的结合、传感特性等等。适体设计的大多数实际方法使用高通量实验(例如SELEX)识别一小部分有前景的候选序列,然后通过仅对凭经验找到的候选序列进行微小修改来优化性能。具有所需特性但化学成分差异很大的序列将增加搜索空间的多样性,并有助于发现有用的核酸适体。需要系统的多样化方案。在这里,我们建议使用一种称为Potts模型的无监督机器学习模型来发现具有可控序列多样性的新的有用序列。我们首先使用最大熵原理在一小组由共同特征统一的凭经验识别的序列上训练Potts模型。为了生成具有可控多样性程度的新候选序列,我们利用模型的光谱特征:一个“能量”带隙,将与训练集相似的序列与不同的序列分开。通过控制采样的Potts能量范围,我们生成与训练集不同但仍可能具有编码特征的序列。为了证明性能,我们将我们的方法应用于设计30聚体RNA和DNA适体中具有指定二级结构基序的不同序列池。