Department of Statistics, University of Kentucky, 725 Rose Street, Lexington, KY 40536-0082, USA.
BMC Bioinformatics. 2012 Aug 21;13:210. doi: 10.1186/1471-2105-13-210.
The increased use of multi-locus data sets for phylogenetic reconstruction has increased the need to determine whether a set of gene trees significantly deviate from the phylogenetic patterns of other genes. Such unusual gene trees may have been influenced by other evolutionary processes such as selection, gene duplication, or horizontal gene transfer.
Motivated by this problem we propose a nonparametric goodness-of-fit test for two empirical distributions of gene trees, and we developed the software GeneOut to estimate a p-value for the test. Our approach maps trees into a multi-dimensional vector space and then applies support vector machines (SVMs) to measure the separation between two sets of pre-defined trees. We use a permutation test to assess the significance of the SVM separation. To demonstrate the performance of GeneOut, we applied it to the comparison of gene trees simulated within different species trees across a range of species tree depths. Applied directly to sets of simulated gene trees with large sample sizes, GeneOut was able to detect very small differences between two set of gene trees generated under different species trees. Our statistical test can also include tree reconstruction into its test framework through a variety of phylogenetic optimality criteria. When applied to DNA sequence data simulated from different sets of gene trees, results in the form of receiver operating characteristic (ROC) curves indicated that GeneOut performed well in the detection of differences between sets of trees with different distributions in a multi-dimensional space. Furthermore, it controlled false positive and false negative rates very well, indicating a high degree of accuracy.
The non-parametric nature of our statistical test provides fast and efficient analyses, and makes it an applicable test for any scenario where evolutionary or other factors can lead to trees with different multi-dimensional distributions. The software GeneOut is freely available under the GNU public license.
随着多基因数据集在系统发育重建中应用的增加,需要确定一组基因树是否显著偏离其他基因的系统发育模式。这些异常的基因树可能受到其他进化过程的影响,如选择、基因复制或水平基因转移。
鉴于此问题,我们提出了一种针对两种基因树经验分布的非参数拟合优度检验,并开发了 GeneOut 软件来估计该检验的 p 值。我们的方法将树映射到多维向量空间中,然后应用支持向量机(SVM)来测量两个预定义树集之间的分离。我们使用置换检验来评估 SVM 分离的显著性。为了演示 GeneOut 的性能,我们将其应用于在不同物种树范围内模拟的基因树之间的比较。直接应用于具有大样本量的模拟基因树集,GeneOut 能够检测到在不同物种树下生成的两组基因树之间非常小的差异。我们的统计检验还可以通过各种系统发育最优性标准将树重建纳入其检验框架。当应用于从不同基因树集模拟的 DNA 序列数据时,以接收者操作特征(ROC)曲线的形式给出的结果表明,GeneOut 在检测多维空间中分布不同的树集之间的差异方面表现良好。此外,它很好地控制了假阳性和假阴性率,表明其具有高度的准确性。
我们的统计检验的非参数性质提供了快速高效的分析,并使其成为任何可能导致具有不同多维分布的树的进化或其他因素的适用检验。GeneOut 软件根据 GNU 公共许可证免费提供。