School of Computer Science, McGill University, Montreal, Quebec H3A 0G4, Canada.
Bioinformatics. 2020 Jul 1;36(Suppl_1):i353-i361. doi: 10.1093/bioinformatics/btaa447.
Accurate probabilistic models of sequence evolution are essential for a wide variety of bioinformatics tasks, including sequence alignment and phylogenetic inference. The ability to realistically simulate sequence evolution is also at the core of many benchmarking strategies. Yet, mutational processes have complex context dependencies that remain poorly modeled and understood.
We introduce EvoLSTM, a recurrent neural network-based evolution simulator that captures mutational context dependencies. EvoLSTM uses a sequence-to-sequence long short-term memory model trained to predict mutation probabilities at each position of a given sequence, taking into consideration the 14 flanking nucleotides. EvoLSTM can realistically simulate mammalian and plant DNA sequence evolution and reveals unexpectedly strong long-range context dependencies in mutation probabilities. EvoLSTM brings modern machine-learning approaches to bear on sequence evolution. It will serve as a useful tool to study and simulate complex mutational processes.
Code and dataset are available at https://github.com/DongjoonLim/EvoLSTM.
Supplementary data are available at Bioinformatics online.
准确的序列进化概率模型对于各种生物信息学任务至关重要,包括序列比对和系统发育推断。真实模拟序列进化的能力也是许多基准测试策略的核心。然而,突变过程具有复杂的上下文依赖关系,这些关系仍然建模和理解得很差。
我们引入了 EvoLSTM,这是一种基于递归神经网络的进化模拟器,它可以捕获突变的上下文依赖关系。EvoLSTM 使用序列到序列的长短时记忆模型进行训练,以预测给定序列中每个位置的突变概率,同时考虑到 14 个侧翼核苷酸。EvoLSTM 可以真实地模拟哺乳动物和植物 DNA 序列进化,并揭示出突变概率中出人意料的强远程上下文依赖关系。EvoLSTM 将现代机器学习方法应用于序列进化。它将成为研究和模拟复杂突变过程的有用工具。
代码和数据集可在 https://github.com/DongjoonLim/EvoLSTM 上获得。
补充数据可在Bioinformatics 在线获得。