Center for Biophysics and Computational Biology, Temple University, Philadelphia, 19122, USA.
Institute for Computational Molecular Science, Temple University, Philadelphia, 19122, USA.
Nat Commun. 2021 Nov 2;12(1):6302. doi: 10.1038/s41467-021-26529-9.
Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the "generative capacity" of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model's generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE's lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.
泊松模型和变分自动编码器 (VAEs) 最近作为生成蛋白质序列模型 (GPSMs) 受到关注,用于探索适应度景观和预测突变效应。尽管取得了令人鼓舞的结果,但由于上位性,当前的模型评估指标仍不清楚 GPSMs 是否忠实地再现了自然序列中观察到的复杂多残基突变模式。在这里,我们开发了一组序列统计数据来评估三种当前 GPSMs 的“生成能力”:成对泊松哈密顿量、变分自动编码器和独立于位置的模型。我们表明,泊松模型的生成能力最大,因为模型生成的高阶突变统计数据与自然序列中观察到的一致,而变分自动编码器的则介于泊松和独立于位置的模型之间。重要的是,我们的工作为评估和解释 GPSM 准确性提供了一个新的框架,该框架强调了高阶协变和上位性的作用,对一般的概率序列模型具有更广泛的意义。