Suppr超能文献

使用深度学习框架探索功能蛋白之间的氨基酸序列空间。

Navigating the amino acid sequence space between functional proteins using a deep learning framework.

作者信息

Bitard-Feildel Tristan

机构信息

IBPS, CNRS, Laboratoire de Biologie Computationnelle et Quantitative, Sorbonne Université, Paris, France.

Institut des Sciences du Calcul et de des Données (ISCD), Sorbonne Université, Paris, France.

出版信息

PeerJ Comput Sci. 2021 Sep 17;7:e684. doi: 10.7717/peerj-cs.684. eCollection 2021.

Abstract

MOTIVATION

Shedding light on the relationships between protein sequences and functions is a challenging task with many implications in protein evolution, diseases understanding, and protein design. The protein sequence space mapping to specific functions is however hard to comprehend due to its complexity. Generative models help to decipher complex systems thanks to their abilities to learn and recreate data specificity. Applied to proteins, they can capture the sequence patterns associated with functions and point out important relationships between sequence positions. By learning these dependencies between sequences and functions, they can ultimately be used to generate new sequences and navigate through uncharted area of molecular evolution.

RESULTS

This study presents an Adversarial Auto-Encoder (AAE) approached, an unsupervised generative model, to generate new protein sequences. AAEs are tested on three protein families known for their multiple functions the sulfatase, the HUP and the TPP families. Clustering results on the encoded sequences from the latent space computed by AAEs display high level of homogeneity regarding the protein sequence functions. The study also reports and analyzes for the first time two sampling strategies based on latent space interpolation and latent space arithmetic to generate intermediate protein sequences sharing sequential properties of original sequences linked to known functional properties issued from different families and functions. Generated sequences by interpolation between latent space data points demonstrate the ability of the AAE to generalize and produce meaningful biological sequences from an evolutionary uncharted area of the biological sequence space. Finally, 3D structure models computed by comparative modelling using generated sequences and templates of different sub-families point out to the ability of the latent space arithmetic to successfully transfer protein sequence properties linked to function between different sub-families. All in all this study confirms the ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces.

摘要

动机

阐明蛋白质序列与功能之间的关系是一项具有挑战性的任务,在蛋白质进化、疾病理解和蛋白质设计等方面有诸多意义。然而,由于蛋白质序列空间的复杂性,映射到特定功能的蛋白质序列空间难以理解。生成模型因其学习和重建数据特异性的能力,有助于解读复杂系统。应用于蛋白质时,它们可以捕捉与功能相关的序列模式,并指出序列位置之间的重要关系。通过学习序列与功能之间的这些依赖性,它们最终可用于生成新序列,并在分子进化的未知领域中探索。

结果

本研究提出了一种对抗自编码器(AAE)方法,这是一种无监督生成模型,用于生成新的蛋白质序列。在三个以多种功能闻名的蛋白质家族——硫酸酯酶家族、HUP家族和TPP家族上对AAE进行了测试。对由AAE计算出的潜在空间中的编码序列进行聚类分析,结果显示出蛋白质序列功能具有高度同质性。该研究还首次报告并分析了基于潜在空间插值和潜在空间算法的两种采样策略,以生成具有与不同家族和功能相关的已知功能特性的原始序列的序列特性的中间蛋白质序列。通过在潜在空间数据点之间进行插值生成的序列证明了AAE能够从生物序列空间的进化未知区域中进行泛化并生成有意义的生物序列。最后,使用生成序列和不同亚家族模板通过比较建模计算出的三维结构模型表明,潜在空间算法能够成功地在不同亚家族之间转移与功能相关的蛋白质序列特性。总而言之,本研究证实了深度学习框架对生物复杂性进行建模的能力,并带来了探索氨基酸序列和功能空间的新工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db26/8459775/65c505051029/peerj-cs-07-684-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验