CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, Sorbonne Université, 4 place Jussieu, 75005 Paris, France.
Institut des Sciences du Calcul et des Données, Sorbonne Université, Paris, France.
Mol Biol Evol. 2022 Apr 10;39(4). doi: 10.1093/molbev/msac070.
Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipated. Their identification appears critical for a fundamental understanding of the evolution of living organisms and for biotechnological applications. ProfileView is a sequence-based computational method, designed to functionally classify sets of homologous sequences. It relies on two main ideas: the use of multiple profile models whose construction explores evolutionary information in available databases, and a novel definition of a representation space in which to analyze sequences with multiple profile models combined together. ProfileView classifies protein families by enriching known functional groups with new sequences and discovering new groups and subgroups. We validate ProfileView on seven classes of widespread proteins involved in the interaction with nucleic acids, amino acids and small molecules, and in a large variety of functions and enzymatic reactions. ProfileView agrees with the large set of functional data collected for these proteins from the literature regarding the organization into functional subgroups and residues that characterize the functions. In addition, ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions. On protein families with complex domain architecture, ProfileView functional classification reconciles domain combinations, unlike phylogenetic reconstruction. ProfileView proves to outperform the functional classification approach PANTHER, the two k-mer-based methods CUPP and eCAMI and a neural network approach based on Restricted Boltzmann Machines. It overcomes time complexity limitations of the latter.
从序列 alone 对蛋白质进行功能分类已成为理解我们数据库中积累的大量蛋白质序列的关键瓶颈。同源序列的多样性在许多情况下隐藏了各种无法预料的功能活动。它们的鉴定对于理解生物进化的基本原理和生物技术应用至关重要。ProfileView 是一种基于序列的计算方法,旨在对同源序列集进行功能分类。它依赖于两个主要思想:使用多个 profile 模型,其构建探索了可用数据库中的进化信息,以及在分析与多个 profile 模型组合在一起的序列的表示空间中定义新的方法。ProfileView 通过用新序列丰富已知功能组并发现新的组和子组来对蛋白质家族进行分类。我们在涉及与核酸、氨基酸和小分子相互作用以及各种功能和酶反应的七种广泛存在的蛋白质类上验证了 ProfileView。ProfileView 与从文献中收集的这些蛋白质的大量功能数据一致,涉及到功能子组和特征功能的残基的组织。此外,ProfileView 解决了未定义的功能分类,并提取了蛋白质功能多样性的分子决定因素,显示了其选择序列进行准确实验设计和发现新生物学功能的潜力。对于具有复杂结构域架构的蛋白质家族,ProfileView 的功能分类与系统发育重建不同,能够协调结构域组合。ProfileView 证明优于功能分类方法 PANTHER、两种基于 k-mer 的方法 CUPP 和 eCAMI 以及基于受限玻尔兹曼机的神经网络方法。它克服了后一种方法的时间复杂度限制。