高效的大规模蛋白质序列比较和基因匹配，以识别直系同源物和共直系同源物。

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs.

机构信息

Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia.

出版信息

Nucleic Acids Res. 2012 Mar;40(6):e44. doi: 10.1093/nar/gkr1261. Epub 2011 Dec 30.

DOI:10.1093/nar/gkr1261

PMID:22210858

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3315314/

Abstract

Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/∼kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/∼kmahmood/EGM2.

摘要

广义而言，同源基因分配的计算方法是一个三步骤的过程：（i）在基因组之间识别所有假定的同源物，（ii）识别基因锚点，（iii）根据它们的顺序和上下文将锚点链接起来以识别最佳基因匹配。在本文中，我们设计了两种方法来改进该流程的两个重要方面[特别是步骤（ii）和（iii）]。首先，计算序列相似性数据[步骤（i）]对于大型序列集来说是一项计算密集型任务，这在同源基因分配管道中形成了瓶颈。我们设计了一种快速且高度可扩展的基于 k-mer 计数的排序-连接方法（afree），以快速比较大型蛋白质序列集中的所有序列对，从而识别假定的同源物。其次，可用性复杂的基因组包含具有复杂进化事件（例如复制）的大型基因家族，使得分配同源基因和共同源基因的任务变得困难。在这里，我们开发了一种迭代图匹配策略，其中在每次迭代中，都会确定最佳的基因分配，从而得到一组同源基因和共同源基因。我们发现 afree 算法比现有方法更快，并且在识别相似基因方面保持了很高的准确性。迭代图匹配策略在识别复杂基因关系方面也表现出了很高的准确性。afree 可从 http://vbc.med.monash.edu.au/∼kmahmood/afree 获得。完整的同源基因分配管道（包括 afree 和迭代图匹配方法）可从 http://vbc.med.monash.edu.au/∼kmahmood/EGM2 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c092/3315314/9f8ca0ecabb2/gkr1261f1.jpg

相似文献

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs.

Nucleic Acids Res. 2012 Mar;40(6):e44. doi: 10.1093/nar/gkr1261. Epub 2011 Dec 30.

EGM: encapsulated gene-by-gene matching to identify gene orthologs and homologous segments in genomes.

Bioinformatics. 2010 Sep 1;26(17):2076-84. doi: 10.1093/bioinformatics/btq339. Epub 2010 Jun 27.

DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection.

BMC Bioinformatics. 2010 Oct 15;11 Suppl 7(Suppl 7):S6. doi: 10.1186/1471-2105-11-S7-S6.

Primary orthologs from local sequence context.

BMC Bioinformatics. 2020 Feb 6;21(1):48. doi: 10.1186/s12859-020-3384-2.

Assignment of orthologous genes via genome rearrangement.

IEEE/ACM Trans Comput Biol Bioinform. 2005 Oct-Dec;2(4):302-15. doi: 10.1109/TCBB.2005.48.

Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.

J Mol Biol. 2001 Dec 14;314(5):1041-52. doi: 10.1006/jmbi.2000.5197.

BBH-LS: an algorithm for computing positional homologs using sequence and gene context similarity.

BMC Syst Biol. 2012;6 Suppl 1(Suppl 1):S22. doi: 10.1186/1752-0509-6-S1-S22. Epub 2012 Jul 16.

MSOAR: a high-throughput ortholog assignment system based on genome rearrangement.

J Comput Biol. 2007 Nov;14(9):1160-75. doi: 10.1089/cmb.2007.0048.

MSOAR 2.0: Incorporating tandem duplications into ortholog assignment based on genome rearrangement.

BMC Bioinformatics. 2010 Jan 6;11:10. doi: 10.1186/1471-2105-11-10.

Using shared genomic synteny and shared protein functions to enhance the identification of orthologous gene pairs.

Bioinformatics. 2005 Mar;21(6):703-10. doi: 10.1093/bioinformatics/bti045. Epub 2004 Sep 30.

引用本文的文献

Graph Theory-Based Sequence Descriptors as Remote Homology Predictors.

Biomolecules. 2019 Dec 23;10(1):26. doi: 10.3390/biom10010026.

RAFTSG: an efficient and versatile clustering software to analyses in large protein datasets.

BMC Bioinformatics. 2019 Jul 15;20(1):392. doi: 10.1186/s12859-019-2973-4.

Alternative characterizations of Fitch's xenology relation.

J Math Biol. 2019 Aug;79(3):969-986. doi: 10.1007/s00285-019-01384-x. Epub 2019 May 20.

Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers.

BMC Bioinformatics. 2018 May 3;19(1):166. doi: 10.1186/s12859-018-2148-8.

OrthoGNC: A Software for Accurate Identification of Orthologs Based on Gene Neighborhood Conservation.

Genomics Proteomics Bioinformatics. 2017 Dec;15(6):361-370. doi: 10.1016/j.gpb.2017.07.002. Epub 2017 Nov 11.

A null model for microbial diversification.

Proc Natl Acad Sci U S A. 2017 Jul 3;114(27):E5414-E5423. doi: 10.1073/pnas.1619993114. Epub 2017 Jun 19.

PATtyFams: Protein Families for the Microbial Genomes in the PATRIC Database.

Front Microbiol. 2016 Feb 8;7:118. doi: 10.3389/fmicb.2016.00118. eCollection 2016.

Genome-Wide Identification of Calcium Dependent Protein Kinase Gene Family in Plant Lineage Shows Presence of Novel D-x-D and D-E-L Motifs in EF-Hand Domain.

Front Plant Sci. 2015 Dec 24;6:1146. doi: 10.3389/fpls.2015.01146. eCollection 2015.

Orthology detection combining clustering and synteny for very large datasets.

PLoS One. 2014 Aug 19;9(8):e105015. doi: 10.1371/journal.pone.0105015. eCollection 2014.

PhyloTreePruner: A Phylogenetic Tree-Based Approach for Selection of Orthologous Sequences for Phylogenomics.

Evol Bioinform Online. 2013 Oct 29;9:429-35. doi: 10.4137/EBO.S12813. eCollection 2013.

本文引用的文献

Inf Retr Boston. 2010 Dec;13(6):601-617. doi: 10.1007/s10791-010-9126-8. Epub 2010 Jan 23.

Search and clustering orders of magnitude faster than BLAST.

Bioinformatics. 2010 Oct 1;26(19):2460-1. doi: 10.1093/bioinformatics/btq461. Epub 2010 Aug 12.

An alignment-free model for comparison of regulatory sequences.

Bioinformatics. 2010 Oct 1;26(19):2391-7. doi: 10.1093/bioinformatics/btq453. Epub 2010 Aug 9.

An alignment-free method to identify candidate orthologous enhancers in multiple Drosophila genomes.

Bioinformatics. 2010 Sep 1;26(17):2109-15. doi: 10.1093/bioinformatics/btq358. Epub 2010 Jul 11.

EGM: encapsulated gene-by-gene matching to identify gene orthologs and homologous segments in genomes.

Bioinformatics. 2010 Sep 1;26(17):2076-84. doi: 10.1093/bioinformatics/btq339. Epub 2010 Jun 27.

Alignment-free local structural search by writhe decomposition.

Bioinformatics. 2010 May 1;26(9):1176-84. doi: 10.1093/bioinformatics/btq127. Epub 2010 Apr 5.

A novel alignment-free method for comparing transcription factor binding site motifs.

PLoS One. 2010 Jan 20;5(1):e8797. doi: 10.1371/journal.pone.0008797.

Finding regulatory DNA motifs using alignment-free evolutionary conservation information.

Nucleic Acids Res. 2010 Apr;38(6):e90. doi: 10.1093/nar/gkp1166. Epub 2010 Jan 4.

Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution.

Proc Natl Acad Sci U S A. 2010 Jan 5;107(1):133-8. doi: 10.1073/pnas.0913033107. Epub 2009 Dec 14.

Accurate identification of orthologous segments among multiple genomes.

Bioinformatics. 2009 Apr 1;25(7):853-60. doi: 10.1093/bioinformatics/btp070. Epub 2009 Feb 2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

高效的大规模蛋白质序列比较和基因匹配，以识别直系同源物和共直系同源物。

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs.

机构信息

Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia.

出版信息

Nucleic Acids Res. 2012 Mar;40(6):e44. doi: 10.1093/nar/gkr1261. Epub 2011 Dec 30.

DOI:10.1093/nar/gkr1261

PMID:22210858

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3315314/

Abstract

摘要

高效的大规模蛋白质序列比较和基因匹配，以识别直系同源物和共直系同源物。

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

高效的大规模蛋白质序列比较和基因匹配，以识别直系同源物和共直系同源物。

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献