Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
Program in Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
Nat Methods. 2018 Oct;15(10):816-822. doi: 10.1038/s41592-018-0138-4. Epub 2018 Sep 24.
The functions of proteins and RNAs are defined by the collective interactions of many residues, and yet most statistical models of biological sequences consider sites nearly independently. Recent approaches have demonstrated benefits of including interactions to capture pairwise covariation, but leave higher-order dependencies out of reach. Here we show how it is possible to capture higher-order, context-dependent constraints in biological sequences via latent variable models with nonlinear dependencies. We found that DeepSequence ( https://github.com/debbiemarkslab/DeepSequence ), a probabilistic model for sequence families, predicted the effects of mutations across a variety of deep mutational scanning experiments substantially better than existing methods based on the same evolutionary data. The model, learned in an unsupervised manner solely on the basis of sequence information, is grounded with biologically motivated priors, reveals the latent organization of sequence families, and can be used to explore new parts of sequence space.
蛋白质和 RNA 的功能是由许多残基的集体相互作用定义的,但大多数生物序列的统计模型几乎都是独立考虑位点的。最近的方法已经证明了包含相互作用以捕获成对协变的好处,但仍然无法达到更高阶的依赖关系。在这里,我们展示了如何通过具有非线性依赖性的潜在变量模型来捕获生物序列中的高阶、上下文相关约束。我们发现,DeepSequence(https://github.com/debbiemarkslab/DeepSequence),一种用于序列家族的概率模型,在各种深度突变扫描实验中预测突变的效果远远优于基于相同进化数据的现有方法。该模型是在仅基于序列信息的无监督方式下学习的,它基于生物学上有意义的先验知识,揭示了序列家族的潜在组织,并且可以用于探索序列空间的新部分。