Suppr超能文献

基因组背景方法的系统研究:校准、归一化和组合。

A systematic study of genome context methods: calibration, normalization and combination.

机构信息

Artificial Intelligence Center, SRI International, Menlo Park, California, USA.

出版信息

BMC Bioinformatics. 2010 Oct 1;11:493. doi: 10.1186/1471-2105-11-493.

Abstract

BACKGROUND

Genome context methods have been introduced in the last decade as automatic methods to predict functional relatedness between genes in a target genome using the patterns of existence and relative locations of the homologs of those genes in a set of reference genomes. Much work has been done in the application of these methods to different bioinformatics tasks, but few papers present a systematic study of the methods and their combination necessary for their optimal use.

RESULTS

We present a thorough study of the four main families of genome context methods found in the literature: phylogenetic profile, gene fusion, gene cluster, and gene neighbor. We find that for most organisms the gene neighbor method outperforms the phylogenetic profile method by as much as 40% in sensitivity, being competitive with the gene cluster method at low sensitivities. Gene fusion is generally the worst performing of the four methods. A thorough exploration of the parameter space for each method is performed and results across different target organisms are presented. We propose the use of normalization procedures as those used on microarray data for the genome context scores. We show that substantial gains can be achieved from the use of a simple normalization technique. In particular, the sensitivity of the phylogenetic profile method is improved by around 25% after normalization, resulting, to our knowledge, on the best-performing phylogenetic profile system in the literature. Finally, we show results from combining the various genome context methods into a single score. When using a cross-validation procedure to train the combiners, with both original and normalized scores as input, a decision tree combiner results in gains of up to 20% with respect to the gene neighbor method. Overall, this represents a gain of around 15% over what can be considered the state of the art in this area: the four original genome context methods combined using a procedure like that used in the STRING database. Unfortunately, we find that these gains disappear when the combiner is trained only with organisms that are phylogenetically distant from the target organism.

CONCLUSIONS

Our experiments indicate that gene neighbor is the best individual genome context method and that gains from the combination of individual methods are very sensitive to the training data used to obtain the combiner's parameters. If adequate training data is not available, using the gene neighbor score by itself instead of a combined score might be the best choice.

摘要

背景

在过去的十年中,基因组背景方法作为一种自动方法被引入,用于使用目标基因组中这些基因的同源物的存在模式和相对位置来预测目标基因组中基因之间的功能相关性。已经在将这些方法应用于不同的生物信息学任务方面做了很多工作,但是很少有论文对这些方法及其组合进行系统的研究,以实现它们的最佳使用。

结果

我们对文献中发现的四种主要的基因组背景方法家族进行了深入研究:系统发育谱、基因融合、基因簇和基因邻居。我们发现,对于大多数生物,基因邻居方法的灵敏度比系统发育谱方法高 40%,在低灵敏度时与基因簇方法具有竞争力。基因融合通常是四种方法中性能最差的。对每种方法的参数空间进行了彻底的探索,并给出了不同目标生物的结果。我们建议对基因组背景分数使用类似于微阵列数据的归一化程序。我们表明,使用简单的归一化技术可以获得实质性的收益。特别是,归一化后系统发育谱方法的灵敏度提高了约 25%,这是我们所知的文献中表现最好的系统发育谱系统。最后,我们展示了将各种基因组背景方法组合成单个分数的结果。当使用交叉验证程序来训练组合器,并使用原始和归一化的分数作为输入时,决策树组合器相对于基因邻居方法可获得高达 20%的增益。总的来说,这代表着比该领域的现有技术(使用类似于 STRING 数据库中的过程组合四个原始基因组背景方法)提高了约 15%。不幸的是,我们发现当组合器仅使用与目标生物在系统发育上相距较远的生物进行训练时,这些增益会消失。

结论

我们的实验表明,基因邻居是最好的单个基因组背景方法,而单个方法的组合增益对用于获得组合器参数的训练数据非常敏感。如果没有足够的训练数据,单独使用基因邻居分数而不是组合分数可能是最佳选择。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/330e/3247869/0d8cfb376eab/1471-2105-11-493-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验