Suppr超能文献

从进化和实验标记数据中学习蛋白质适应性模型。

Learning protein fitness models from evolutionary and assay-labeled data.

机构信息

Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA.

Center for Computational Biology, University of California, Berkeley, USA.

出版信息

Nat Biotechnol. 2022 Jul;40(7):1114-1122. doi: 10.1038/s41587-021-01146-5. Epub 2022 Jan 17.

Abstract

Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.

摘要

基于机器学习的蛋白质适应性模型通常从无标签、进化相关的序列或具有实验测量标签的变体序列中进行学习。对于只有有限实验数据的情况,最近的工作已经提出了结合这两种信息源的方法。为此,我们提出了一种简单的组合方法,该方法与更复杂的方法具有竞争力,并且平均表现优于它们。我们的方法在基于位置的氨基酸特征上使用岭回归,并结合进化数据建模的一个概率密度特征。在这种方法中,我们发现基于变分自动编码器的概率密度模型的整体性能最好,尽管可以使用任何进化密度模型。此外,我们的分析强调了系统评估和充分基准的重要性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验