Suppr超能文献

通过基于同源性和表示的层次聚类探索大型蛋白质序列空间。

Exploring Large Protein Sequence Space through Homology- and Representation-based Hierarchical Clustering.

作者信息

Chen John Z, Gall Barnabas, Pulsford Sacha B, Tokuriki Nobuhiko, Jackson Colin J

机构信息

Research School of Chemistry, Australian National University, Canberra, Australia.

ARC Centre of Excellence in Synthetic Biology, Research School of Biology, Australian National University, Canberra, Australia.

出版信息

Mol Biol Evol. 2025 Jun 4;42(6). doi: 10.1093/molbev/msaf136.

Abstract

Exploration of protein sequence space can offer insight into protein sequence-function relationships, benefitting both basic science and industrial applications. The use of sequence similarity networks is a standard method for exploring large sequence datasets, but is currently limited when scaling to very large datasets and when viewing more than one level (hierarchy) of homology. Here, we present a sequence analysis pipeline with a number of innovations that address some limitations of traditional sequence similarity networks. First, we develop a hierarchical visualization approach that captures the full range of homologies across protein superfamilies. Second, we leverage representations embedded by protein language models as an alternative homology metric to the Basic Local Alignment Search Tool, showing that they produce comparable results when identifying isofunctional protein families. Finally, we demonstrate that unbiased representative sampling of sequences from genetic neighborhoods can be achieved through the use of HMMs or vector representations. The utility of these methods is exemplified by updating the sequence-function analysis of the FMN/F420-binding split barrel superfamily and the nuclear transport factor 2-like superfamily. We also improve the phylogenetic analysis of the FMN/F420-binding split barrel superfamily with more even and diverse sequence sampling across the superfamily. We provide our sequence exploration pipeline as publicly available code (ProteinClusterTools) and show it to be scalable to large datasets (∼445 k sequences) using desktop computers.

摘要

对蛋白质序列空间的探索能够深入了解蛋白质序列与功能的关系,这对基础科学和工业应用都有益处。使用序列相似性网络是探索大型序列数据集的标准方法,但目前在扩展到非常大的数据集以及查看多个同源性层次时存在局限性。在此,我们提出了一个具有多项创新的序列分析流程,以解决传统序列相似性网络的一些局限性。首先,我们开发了一种层次可视化方法,该方法能够捕捉蛋白质超家族中全方位的同源性。其次,我们利用蛋白质语言模型嵌入的表示作为基本局部比对搜索工具(BLAST)之外的另一种同源性度量,结果表明在识别同功能蛋白质家族时它们能产生可比的结果。最后,我们证明通过使用隐马尔可夫模型(HMM)或向量表示,可以实现对遗传邻域中序列的无偏代表性采样。通过更新FMN/F420结合分裂桶超家族和核转运因子2样超家族的序列 - 功能分析,例证了这些方法的实用性。我们还通过在超家族中进行更均匀和多样的序列采样,改进了FMN/F420结合分裂桶超家族的系统发育分析。我们将我们的序列探索流程作为公开可用的代码(ProteinClusterTools)提供,并展示了它在使用台式计算机时可扩展到大型数据集(约445,000个序列)。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验