Suppr
超能文献

基因组背景方法的系统研究：校准、归一化和组合。

A systematic study of genome context methods: calibration, normalization and combination.

机构信息

Artificial Intelligence Center, SRI International, Menlo Park, California, USA.

出版信息

BMC Bioinformatics. 2010 Oct 1;11:493. doi: 10.1186/1471-2105-11-493.

DOI:10.1186/1471-2105-11-493

PMID:20920312

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3247869/

Abstract

BACKGROUND

Genome context methods have been introduced in the last decade as automatic methods to predict functional relatedness between genes in a target genome using the patterns of existence and relative locations of the homologs of those genes in a set of reference genomes. Much work has been done in the application of these methods to different bioinformatics tasks, but few papers present a systematic study of the methods and their combination necessary for their optimal use.

RESULTS

We present a thorough study of the four main families of genome context methods found in the literature: phylogenetic profile, gene fusion, gene cluster, and gene neighbor. We find that for most organisms the gene neighbor method outperforms the phylogenetic profile method by as much as 40% in sensitivity, being competitive with the gene cluster method at low sensitivities. Gene fusion is generally the worst performing of the four methods. A thorough exploration of the parameter space for each method is performed and results across different target organisms are presented. We propose the use of normalization procedures as those used on microarray data for the genome context scores. We show that substantial gains can be achieved from the use of a simple normalization technique. In particular, the sensitivity of the phylogenetic profile method is improved by around 25% after normalization, resulting, to our knowledge, on the best-performing phylogenetic profile system in the literature. Finally, we show results from combining the various genome context methods into a single score. When using a cross-validation procedure to train the combiners, with both original and normalized scores as input, a decision tree combiner results in gains of up to 20% with respect to the gene neighbor method. Overall, this represents a gain of around 15% over what can be considered the state of the art in this area: the four original genome context methods combined using a procedure like that used in the STRING database. Unfortunately, we find that these gains disappear when the combiner is trained only with organisms that are phylogenetically distant from the target organism.

CONCLUSIONS

Our experiments indicate that gene neighbor is the best individual genome context method and that gains from the combination of individual methods are very sensitive to the training data used to obtain the combiner's parameters. If adequate training data is not available, using the gene neighbor score by itself instead of a combined score might be the best choice.

摘要

背景

在过去的十年中，基因组背景方法作为一种自动方法被引入，用于使用目标基因组中这些基因的同源物的存在模式和相对位置来预测目标基因组中基因之间的功能相关性。已经在将这些方法应用于不同的生物信息学任务方面做了很多工作，但是很少有论文对这些方法及其组合进行系统的研究，以实现它们的最佳使用。

结果

我们对文献中发现的四种主要的基因组背景方法家族进行了深入研究：系统发育谱、基因融合、基因簇和基因邻居。我们发现，对于大多数生物，基因邻居方法的灵敏度比系统发育谱方法高 40%，在低灵敏度时与基因簇方法具有竞争力。基因融合通常是四种方法中性能最差的。对每种方法的参数空间进行了彻底的探索，并给出了不同目标生物的结果。我们建议对基因组背景分数使用类似于微阵列数据的归一化程序。我们表明，使用简单的归一化技术可以获得实质性的收益。特别是，归一化后系统发育谱方法的灵敏度提高了约 25%，这是我们所知的文献中表现最好的系统发育谱系统。最后，我们展示了将各种基因组背景方法组合成单个分数的结果。当使用交叉验证程序来训练组合器，并使用原始和归一化的分数作为输入时，决策树组合器相对于基因邻居方法可获得高达 20%的增益。总的来说，这代表着比该领域的现有技术（使用类似于 STRING 数据库中的过程组合四个原始基因组背景方法）提高了约 15%。不幸的是，我们发现当组合器仅使用与目标生物在系统发育上相距较远的生物进行训练时，这些增益会消失。

结论

我们的实验表明，基因邻居是最好的单个基因组背景方法，而单个方法的组合增益对用于获得组合器参数的训练数据非常敏感。如果没有足够的训练数据，单独使用基因邻居分数而不是组合分数可能是最佳选择。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/330e/3247869/0d8cfb376eab/1471-2105-11-493-1.jpg

相似文献

A systematic study of genome context methods: calibration, normalization and combination.

BMC Bioinformatics. 2010 Oct 1;11:493. doi: 10.1186/1471-2105-11-493.

Effect of reference genome selection on the performance of computational methods for genome-wide protein-protein interaction prediction.

PLoS One. 2012;7(7):e42057. doi: 10.1371/journal.pone.0042057. Epub 2012 Jul 26.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

InPrePPI: an integrated evaluation method based on genomic context for predicting protein-protein interactions in prokaryotic genomes.

BMC Bioinformatics. 2007 Oct 26;8:414. doi: 10.1186/1471-2105-8-414.

The Adaptive Evolution Database (TAED): A New Release of a Database of Phylogenetically Indexed Gene Families from Chordates.

J Mol Evol. 2017 Aug;85(1-2):46-56. doi: 10.1007/s00239-017-9806-8. Epub 2017 Aug 9.

Optimized LOWESS normalization parameter selection for DNA microarray data.

BMC Bioinformatics. 2004 Dec 9;5:194. doi: 10.1186/1471-2105-5-194.

Comparative assessment of performance and genome dependence among phylogenetic profiling methods.

BMC Bioinformatics. 2006 Sep 27;7:420. doi: 10.1186/1471-2105-7-420.

PLAZA: a comparative genomics resource to study gene and genome evolution in plants.

Plant Cell. 2009 Dec;21(12):3718-31. doi: 10.1105/tpc.109.071506. Epub 2009 Dec 29.

引用本文的文献

Genome composition and phylogeny of microbes predict their co-occurrence in the environment.

PLoS Comput Biol. 2017 Feb 2;13(2):e1005366. doi: 10.1371/journal.pcbi.1005366. eCollection 2017 Feb.

The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases.

Nucleic Acids Res. 2016 Jan 4;44(D1):D471-80. doi: 10.1093/nar/gkv1164. Epub 2015 Nov 2.

The power of operon rearrangements for predicting functional associations.

Comput Struct Biotechnol J. 2015 Jul 2;13:402-6. doi: 10.1016/j.csbj.2015.06.002. eCollection 2015.

PLoS One. 2015 Jun 1;10(6):e0129066. doi: 10.1371/journal.pone.0129066. eCollection 2015.

Bioinformatics analysis of bacterial annexins--putative ancestral relatives of eukaryotic annexins.

PLoS One. 2014 Jan 16;9(1):e85428. doi: 10.1371/journal.pone.0085428. eCollection 2014.

Tracing evolutionary footprints to identify novel gene functional linkages.

PLoS One. 2013 Jun 25;8(6):e66817. doi: 10.1371/journal.pone.0066817. Print 2013.

Detection of genomic idiosyncrasies using fuzzy phylogenetic profiles.

PLoS One. 2013;8(1):e52854. doi: 10.1371/journal.pone.0052854. Epub 2013 Jan 14.

Experimental evidence validating the computational inference of functional associations from gene fusion events: a critical survey.

Brief Bioinform. 2014 May;15(3):443-54. doi: 10.1093/bib/bbs072. Epub 2012 Dec 5.

Effect of reference genome selection on the performance of computational methods for genome-wide protein-protein interaction prediction.

PLoS One. 2012;7(7):e42057. doi: 10.1371/journal.pone.0042057. Epub 2012 Jul 26.

The CanOE strategy: integrating genomic and metabolic contexts across multiple prokaryote genomes to find candidate genes for orphan enzymes.

PLoS Comput Biol. 2012 May;8(5):e1002540. doi: 10.1371/journal.pcbi.1002540. Epub 2012 May 31.

本文引用的文献

The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases.

Nucleic Acids Res. 2010 Jan;38(Database issue):D473-9. doi: 10.1093/nar/gkp875. Epub 2009 Oct 22.

Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins.

PLoS Biol. 2009 Apr 28;7(4):e96. doi: 10.1371/journal.pbio.1000096.

EcoCyc: a comprehensive view of Escherichia coli biology.

Nucleic Acids Res. 2009 Jan;37(Database issue):D464-70. doi: 10.1093/nar/gkn751. Epub 2008 Oct 30.

STRING 8--a global view on proteins and their functional interactions in 630 organisms.

Nucleic Acids Res. 2009 Jan;37(Database issue):D412-6. doi: 10.1093/nar/gkn760. Epub 2008 Oct 21.

The relative value of operon predictions.

Brief Bioinform. 2008 Sep;9(5):367-75. doi: 10.1093/bib/bbn019. Epub 2008 Apr 17.

Investigation of factors affecting prediction of protein-protein interaction networks by phylogenetic profiling.

BMC Genomics. 2007 Oct 29;8:393. doi: 10.1186/1471-2164-8-393.

The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases.

Nucleic Acids Res. 2008 Jan;36(Database issue):D623-31. doi: 10.1093/nar/gkm900. Epub 2007 Oct 27.

InPrePPI: an integrated evaluation method based on genomic context for predicting protein-protein interactions in prokaryotic genomes.

BMC Bioinformatics. 2007 Oct 26;8:414. doi: 10.1186/1471-2105-8-414.

An improved method for identifying functionally linked proteins using phylogenetic profiles.

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S7. doi: 10.1186/1471-2105-8-S4-S7.

Inferring genome-wide functional linkages in E. coli by combining improved genome context methods: comparison with high-throughput experimental data.

Genome Res. 2007 Apr;17(4):527-35. doi: 10.1101/gr.5900607. Epub 2007 Mar 5.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

基因组背景方法的系统研究：校准、归一化和组合。

A systematic study of genome context methods: calibration, normalization and combination.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译