序列相似性网络揭示了多结构域蛋白的共同祖先。

Sequence similarity network reveals common ancestry of multidomain proteins.

作者信息

Song Nan, Joseph Jacob M, Davis George B, Durand Dannie

机构信息

Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America.

出版信息

PLoS Comput Biol. 2008 May 16;4(4):e1000063. doi: 10.1371/journal.pcbi.1000063.

DOI:10.1371/journal.pcbi.1000063

PMID:18475320

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2377100/

Abstract

We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain architectures, we show that homology can be rationally defined for multidomain families with diverse architectures by considering the genomic context of the genes that encode them. Our study demonstrates the utility of mining network structure for evolutionary information, suggesting this is a fertile approach for investigating evolutionary processes in the post-genomic era.

摘要

我们探讨了具有多样结构域架构的复杂多结构域家族中的同源性识别问题。面临的挑战是区分具有共同祖先的序列对与共享插入结构域但其他方面无关的序列对。这种区分对于基因注释、功能预测和比较基因组学的准确性至关重要。多结构域同源性识别存在两个主要障碍：缺乏正式定义以及缺乏用于评估新方法性能的经过整理的基准。我们针对这两个问题提供了初步解决方案：1）扩展传统的同源性模型以纳入结构域插入；2）对小鼠和人类中经过充分研究的家族进行人工整理的基准。我们还提出了邻域相关性方法，这是一种新颖的方法，它利用序列相似性网络的局部结构，基于基因复制和结构域改组在序列相似性网络中留下不同模式的观察结果，以高精度识别同源物。在使用我们整理的数据进行的严格实证比较中，邻域相关性方法优于序列相似性、比对长度和结构域架构比较。邻域相关性方法非常适合自动化的全基因组规模分析。它易于计算，不需要明确的结构域架构知识，并且能够高精度地对单结构域和多结构域同源物进行分类。通过我们的方法获得的同源物预测结果，以及我们的人工整理基准和用于网络邻域结构探索性分析的基于网络的可视化工具，可在http://www.neighborhoodcorrelation.org获取。我们的工作背离了同源性概念不能应用于经历过结构域改组的基因这一主流观点。与当前要么专注于单个结构域的同源性要么仅考虑具有相同结构域架构的家族的方法不同，我们表明通过考虑编码它们的基因的基因组背景，可以合理地为具有不同架构的多结构域家族定义同源性。我们的研究证明了挖掘网络结构以获取进化信息的实用性，表明这是在后基因组时代研究进化过程的一种富有成效的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/360f/2377100/957e16756caa/pcbi.1000063.g002.jpg

相似文献

PLoS Comput Biol. 2008 May 16;4(4):e1000063. doi: 10.1371/journal.pcbi.1000063.

Family classification without domain chaining.

Bioinformatics. 2009 Jun 15;25(12):i45-53. doi: 10.1093/bioinformatics/btp207.

Domain architecture comparison for multidomain homology identification.

J Comput Biol. 2007 May;14(4):496-516. doi: 10.1089/cmb.2007.A009.

Protein comparison at the domain architecture level.

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S5. doi: 10.1186/1471-2105-10-S15-S5.

Quantification of the elevated rate of domain rearrangements in metazoa.

J Mol Biol. 2007 Oct 5;372(5):1337-48. doi: 10.1016/j.jmb.2007.06.022. Epub 2007 Jun 15.

Convergent evolution of domain architectures (is rare).

Bioinformatics. 2005 Apr 15;21(8):1464-71. doi: 10.1093/bioinformatics/bti204. Epub 2004 Dec 7.

Domain architecture conservation in orthologs.

BMC Bioinformatics. 2011 Aug 5;12:326. doi: 10.1186/1471-2105-12-326.

MACHOS: Markov clusters of homologous subsequences.

Bioinformatics. 2008 Jul 1;24(13):i77-85. doi: 10.1093/bioinformatics/btn144.

High-quality sequence clustering guided by network topology and multiple alignment likelihood.

Bioinformatics. 2012 Apr 15;28(8):1078-85. doi: 10.1093/bioinformatics/bts098. Epub 2012 Feb 25.

引用本文的文献

Salty secrets of Halobacterium salinarum AD88: a new archaeal ecotype isolated from Cuatro Cienegas Basin.

BMC Genomics. 2025 Apr 24;26(1):399. doi: 10.1186/s12864-025-11550-9.

GRAMEP: an alignment-free method based on the maximum entropy principle for identifying SNPs.

BMC Bioinformatics. 2025 Feb 25;26(1):66. doi: 10.1186/s12859-025-06037-z.

A Comparative Analysis of SARS-CoV-2 Variants of Concern (VOC) Spike Proteins Interacting with hACE2 Enzyme.

Int J Mol Sci. 2024 Jul 23;25(15):8032. doi: 10.3390/ijms25158032.

Mining the Biosynthetic Landscape of Lactic Acid Bacteria Unearths a New Family of RiPPs Assembled by a Novel Type of ThiF-like Adenylyltransferases.

ACS Omega. 2024 Jul 3;9(28):30891-30903. doi: 10.1021/acsomega.4c03760. eCollection 2024 Jul 16.

Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings.

Brief Bioinform. 2023 Jan 19;24(1). doi: 10.1093/bib/bbac599.

Study on cocoonase, sericin, and degumming of silk cocoon: computational and experimental.

J Genet Eng Biotechnol. 2021 Feb 16;19(1):32. doi: 10.1186/s43141-021-00125-2.

Domainoid: domain-oriented orthology inference.

BMC Bioinformatics. 2019 Oct 28;20(1):523. doi: 10.1186/s12859-019-3137-2.

MultiDomainBenchmark: a multi-domain query and subject database suite.

BMC Bioinformatics. 2019 Feb 14;20(1):77. doi: 10.1186/s12859-019-2660-5.

CompositeSearch: A Generalized Network Approach for Composite Gene Families Detection.

Mol Biol Evol. 2018 Jan 1;35(1):252-255. doi: 10.1093/molbev/msx283.

Evolutionary and molecular foundations of multiple contemporary functions of the nitroreductase superfamily.

Proc Natl Acad Sci U S A. 2017 Nov 7;114(45):E9549-E9558. doi: 10.1073/pnas.1706849114. Epub 2017 Oct 24.

本文引用的文献

Family expansion and gene rearrangements contributed to the functional specialization of PRDM genes in vertebrates.

BMC Evol Biol. 2007 Oct 4;7:187. doi: 10.1186/1471-2148-7-187.

Stability of characters and construction of phylogenetic trees.

J Comput Biol. 2007 Jun;14(5):539-49. doi: 10.1089/cmb.2007.R001.

Domain architecture comparison for multidomain homology identification.

J Comput Biol. 2007 May;14(4):496-516. doi: 10.1089/cmb.2007.A009.

Modeling the evolution of protein domain architectures using maximum parsimony.

J Mol Biol. 2007 Feb 9;366(1):307-15. doi: 10.1016/j.jmb.2006.11.017. Epub 2006 Nov 10.

Protein homology network families reveal step-wise diversification of Type III and Type IV secretion systems.

PLoS Comput Biol. 2006 Dec 1;2(12):e173. doi: 10.1371/journal.pcbi.0020173.

Assignment of orthologous genes via genome rearrangement.

IEEE/ACM Trans Comput Biol Bioinform. 2005 Oct-Dec;2(4):302-15. doi: 10.1109/TCBB.2005.48.

Functional classification using phylogenomic inference.

PLoS Comput Biol. 2006 Jun 30;2(6):e77. doi: 10.1371/journal.pcbi.0020077.

An initial strategy for comparing proteins at the domain architecture level.

Bioinformatics. 2006 Sep 1;22(17):2081-6. doi: 10.1093/bioinformatics/btl366. Epub 2006 Jul 12.

Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching.

Comput Chem. 1996 Mar;20(1):25-33. doi: 10.1016/s0097-8485(96)80004-0.

Domain deletions and substitutions in the modular protein evolution.

FEBS J. 2006 May;273(9):2037-47. doi: 10.1111/j.1742-4658.2006.05220.x.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

序列相似性网络揭示了多结构域蛋白的共同祖先。

Sequence similarity network reveals common ancestry of multidomain proteins.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献