Weyenberg Grady, Huggins Peter M, Schardl Christopher L, Howe Daniel K, Yoshida Ruriko
Department of Statistics, University of Kentucky, Lexington, KY 40536, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, Plant Pathology Department and Department of Veterinary Science, University of Kentucky, Lexington, KY 40546, USA.
Bioinformatics. 2014 Aug 15;30(16):2280-7. doi: 10.1093/bioinformatics/btu258. Epub 2014 Apr 24.
Although the majority of gene histories found in a clade of organisms are expected to be generated by a common process (e.g. the coalescent process), it is well known that numerous other coexisting processes (e.g. horizontal gene transfers, gene duplication and subsequent neofunctionalization) will cause some genes to exhibit a history distinct from those of the majority of genes. Such 'outlying' gene trees are considered to be biologically interesting, and identifying these genes has become an important problem in phylogenetics.
We propose and implement kdetrees, a non-parametric method for estimating distributions of phylogenetic trees, with the goal of identifying trees that are significantly different from the rest of the trees in the sample. Our method compares favorably with a similar recently published method, featuring an improvement of one polynomial order of computational complexity (to quadratic in the number of trees analyzed), with simulation studies suggesting only a small penalty to classification accuracy. Application of kdetrees to a set of Apicomplexa genes identified several unreliable sequence alignments that had escaped previous detection, as well as a gene independently reported as a possible case of horizontal gene transfer. We also analyze a set of Epichloë genes, fungi symbiotic with grasses, successfully identifying a contrived instance of paralogy.
Our method for estimating tree distributions and identifying outlying trees is implemented as the R package kdetrees and is available for download from CRAN.
尽管在一个生物进化枝中发现的大多数基因历史预计是由一个共同过程(例如合并过程)产生的,但众所周知,许多其他共存过程(例如水平基因转移、基因复制和随后的新功能化)会导致一些基因呈现出与大多数基因不同的历史。这种“异常”基因树被认为具有生物学意义,识别这些基因已成为系统发育学中的一个重要问题。
我们提出并实现了kdetrees,这是一种用于估计系统发育树分布的非参数方法,目的是识别与样本中其他树有显著差异的树。我们的方法与最近发表的一种类似方法相比具有优势,计算复杂度提高了一个多项式阶(达到所分析树数量的二次方),模拟研究表明对分类准确性的影响很小。将kdetrees应用于一组顶复门基因,识别出了几个之前未被检测到的不可靠序列比对,以及一个独立报道的可能是水平基因转移的基因。我们还分析了一组与禾本科植物共生的真菌Epichloë基因,成功识别出一个人为构建的旁系同源实例。
我们用于估计树分布和识别异常树的方法以R包kdetrees实现,可从CRAN下载。