Filipski Alan, Tamura Koichiro, Billing-Ross Paul, Murillo Oscar, Kumar Sudhir
BMC Genomics. 2015;16 Suppl 1(Suppl 1):S13. doi: 10.1186/1471-2164-16-S1-S13. Epub 2015 Jan 15.
A central problem of computational metagenomics is determining the correct placement into an existing phylogenetic tree of individual reads (nucleotide sequences of varying lengths, ranging from hundreds to thousands of bases) obtained using next-generation sequencing of DNA samples from a mixture of known and unknown species. Correct placement allows us to easily identify or classify the sequences in the sample as to taxonomic position or function.
Here we propose a novel method (PhyClass), based on the Minimum Evolution (ME) phylogenetic inference criterion, for determining the appropriate phylogenetic position of each read. Without using heuristics, the new approach efficiently finds the optimal placement of the unknown read in a reference phylogenetic tree given a sequence alignment for the taxa in the tree. In short, the total resulting branch length for the tree is computed for every possible placement of the unknown read and the placement that gives the smallest value for this total is the best (optimal) choice. By taking advantage of computational efficiencies and mathematical formulations, we are able to find the true optimal ME placement for each read in the phylogenetic tree. Using computer simulations, we assessed the accuracy of the new approach for different read lengths over a variety of data sets and phylogenetic trees. We found the accuracy of the new method to be good and comparable to existing Maximum Likelihood (ML) approaches.
In particular, we found that the consensus assignments based on ME and ML approaches are more correct than either method individually. This is true even when the statistical support for read assignments was low, which is inevitable given that individual reads are often short and come from only one gene.
计算宏基因组学的一个核心问题是,对于通过对来自已知和未知物种混合物的DNA样本进行下一代测序获得的单个读段(长度从数百到数千个碱基不等的核苷酸序列),确定其在现有系统发育树中的正确位置。正确的位置确定使我们能够轻松地根据分类位置或功能对样本中的序列进行识别或分类。
在此,我们提出了一种基于最小进化(ME)系统发育推断标准的新方法(PhyClass),用于确定每个读段的合适系统发育位置。在不使用启发式方法的情况下,给定树中分类群的序列比对,新方法能有效地在参考系统发育树中找到未知读段的最佳位置。简而言之,对于未知读段的每一种可能位置,计算树的总分支长度,使该总和值最小的位置就是最佳(最优)选择。通过利用计算效率和数学公式,我们能够在系统发育树中找到每个读段的真正最优ME位置。使用计算机模拟,我们在各种数据集和系统发育树上评估了新方法对于不同读段长度的准确性。我们发现新方法具有良好的准确性,与现有的最大似然(ML)方法相当。
特别是,我们发现基于ME和ML方法的一致性分配比单独使用任何一种方法都更准确。即使在读段分配的统计支持较低时也是如此,鉴于单个读段通常较短且仅来自一个基因,这种情况是不可避免的。