Suppr超能文献

加权统计分箱法:实现统计上一致的全基因组系统发育分析

Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses.

作者信息

Bayzid Md Shamsuzzoha, Mirarab Siavash, Boussau Bastien, Warnow Tandy

机构信息

Department of Computer Science, University of Texas at Austin, Austin, Texas, USA.

Laboratoire de Biométrie et Biologie Évolutive, Université de Lyons, France.

出版信息

PLoS One. 2015 Jun 18;10(6):e0129183. doi: 10.1371/journal.pone.0129183. eCollection 2015.

Abstract

Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning.

摘要

由于生物过程可能导致不同的基因座具有不同的进化历史,物种树估计需要来自多个基因组的多个基因座。虽然许多过程可能导致基因树和物种树之间出现不一致,但由多物种合并模型模拟的不完全谱系分选(ILS)被认为是基因树异质性的主要原因。已经开发了基于合并的方法来估计物种树,其中许多方法通过组合估计的基因树来操作,因此被称为“汇总方法”。由于汇总方法通常速度很快(并且比共同估计基因树和物种树的更复杂的基于合并的方法快得多),它们已成为从多个基因座估计物种树的非常流行的技术。然而,最近的研究表明,在存在基因树估计误差的情况下,汇总方法的准确性可能会降低,而且许多生物学数据集存在大量的基因树估计误差,因此汇总方法在生物学现实条件下可能不太准确。Mirarab等人(《科学》,2014年)提出了“统计分箱”技术,以改善多位点分析中的基因树估计,并表明它提高了MP-EST的准确性,MP-EST是最流行的基于合并的汇总方法之一。统计分箱使用一种简单的启发式方法来评估“可组合性”,然后使用较大的基因集重新计算基因树,具有良好的实证性能,但在系统发育基因组学流程中使用统计分箱不具有统计一致性的理想特性。我们表明,通过箱大小对重新计算的基因树进行加权,可使统计分箱在多物种合并模型下具有统计一致性,并保持良好的实证性能。因此,“加权统计分箱”能够实现高精度的基因组规模物种树估计,并且在多物种合并模型下也具有统计一致性。本研究中使用的新数据可在DOI:http://dx.doi.org/10.6084/m9.figshare.1411146获取,软件可在https://github.com/smirarab/binning获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/625e/4472720/5dd0895e4429/pone.0129183.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验