Department of Physics, University of Florida, Gainesville, Florida, United States of America.
Department of Biomedical Engineering, Yale University, New Haven, Connecticut, United States of America.
PLoS Comput Biol. 2023 Nov 27;19(11):e1011655. doi: 10.1371/journal.pcbi.1011655. eCollection 2023 Nov.
Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative approaches face inference, accuracy, and overfitting- related obstacles when modeling moderately sized to large proteins and/or protein families with low sequence coverage. Here, we present a simple to learn, tunable, and accurate generative model, GENERALIST: GENERAtive nonLInear tenSor-factorizaTion for protein sequences. GENERALIST accurately captures several high order summary statistics of amino acid covariation. GENERALIST also predicts conservative local optimal sequences which are likely to fold in stable 3D structure. Importantly, unlike current methods, the density of sequences in GENERALIST-modeled sequence ensembles closely resembles the corresponding natural ensembles. Finally, GENERALIST embeds protein sequences in an informative latent space. GENERALIST will be an important tool to study protein sequence variability.
蛋白质序列家族的生成模型是蛋白质科学家和工程师的重要工具。然而,当对中等大小到大型蛋白质和/或序列覆盖率低的蛋白质家族进行建模时,最先进的生成方法会面临推理、准确性和过拟合相关的障碍。在这里,我们提出了一个简单易学、可调谐且准确的生成模型 GENERALIST:用于蛋白质序列的 GENERAtive nonLInear tenSor-factorizaTion。GENERALIST 可以准确地捕获氨基酸协变的几个高阶摘要统计信息。GENERALIST 还可以预测保守的局部最优序列,这些序列很可能折叠成稳定的 3D 结构。重要的是,与当前的方法不同,GENERALIST 模型化的序列集合中的序列密度与相应的自然集合非常相似。最后,GENERALIST 将蛋白质序列嵌入到信息丰富的潜在空间中。GENERALIST 将成为研究蛋白质序列可变性的重要工具。