Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, Republic of Korea.
Bioinformatics. 2020 Apr 1;36(7):2047-2052. doi: 10.1093/bioinformatics/btz873.
Accurate prediction of the effects of genetic variation is a major goal in biological research. Towards this goal, numerous machine learning models have been developed to learn information from evolutionary sequence data. The most effective method so far is a deep generative model based on the variational autoencoder (VAE) that models the distributions using a latent variable. In this study, we propose a deep autoregressive generative model named mutationTCN, which employs dilated causal convolutions and attention mechanism for the modeling of inter-residue correlations in a biological sequence.
We show that this model is competitive with the VAE model when tested against a set of 42 high-throughput mutation scan experiments, with the mean improvement in Spearman rank correlation ∼0.023. In particular, our model can more efficiently capture information from multiple sequence alignments with lower effective number of sequences, such as in viral sequence families, compared with the latent variable model. Also, we extend this architecture to a semi-supervised learning framework, which shows high prediction accuracy. We show that our model enables a direct optimization of the data likelihood and allows for a simple and stable training process.
Source code is available at https://github.com/ha01994/mutationTCN.
Supplementary data are available at Bioinformatics online.
准确预测遗传变异的影响是生物研究的主要目标。为此,已经开发了许多机器学习模型来从进化序列数据中学习信息。到目前为止,最有效的方法是基于变分自动编码器(VAE)的深度生成模型,该模型使用潜在变量来建模分布。在这项研究中,我们提出了一种名为 mutationTCN 的深度自回归生成模型,它采用扩张因果卷积和注意力机制来对生物序列中的残基间相关性进行建模。
我们表明,当在一组 42 种高通量突变扫描实验中进行测试时,该模型与 VAE 模型具有竞争力,Spearman 秩相关系数的平均提高约为 0.023。特别是,与潜在变量模型相比,我们的模型可以更有效地从具有较低有效序列数的多序列比对中捕获信息,例如在病毒序列家族中。此外,我们将该架构扩展到半监督学习框架,该框架显示出很高的预测准确性。我们表明,我们的模型可以直接优化数据似然度,并允许简单和稳定的训练过程。
源代码可在 https://github.com/ha01994/mutationTCN 获得。
补充数据可在 Bioinformatics 在线获得。