Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, B3H 4R2, Canada.
Bioinformatics. 2013 Aug 1;29(15):1858-64. doi: 10.1093/bioinformatics/btt313. Epub 2013 Jun 3.
Homology-based taxonomic assignment is impeded by differences between the unassigned read and reference database, forcing a rank-specific classification to the closest (and possibly incorrect) reference lineage. This assignment may be correct only to a general rank (e.g. order) and incorrect below that rank (e.g. family and genus). Algorithms like LCA avoid this by varying the predicted taxonomic rank based on matches to a set of taxonomic references. LCA and related approaches can be conservative, especially if best matches are taxonomically widespread because of events such as lateral gene transfer (LGT).
Our extension to LCA called SPANNER (similarity profile annotater) uses the set of best homology matches (the LCA Profile) for a given sequence and compares this profile with a set of profiles inferred from taxonomic reference organisms. SPANNER provides an assignment that is less sensitive to LGT and other confounding phenomena. In a series of trials on real and artificial datasets, SPANNER outperformed LCA-style algorithms in terms of taxonomic precision and outperformed best BLAST at certain levels of taxonomic novelty in the dataset. We identify examples where LCA made an overly conservative prediction, but SPANNER produced a more precise and correct prediction.
By using profiles of homology matches to represent patterns of genomic similarity that arise because of vertical and lateral inheritance, SPANNER offers an effective compromise between taxonomic assignment based on best BLAST scores, and the conservative approach of LCA and similar approaches.
C++ source code and binaries are freely available at http://kiwi.cs.dal.ca/Software/SPANNER.
Supplementary data are available at Bioinformatics online.
基于同源性的分类学分配受到未分配的读取和参考数据库之间差异的阻碍,迫使分类到最接近的(可能不正确的)参考谱系。这种分配可能只到一般的等级(例如,订单),而低于该等级(例如,家族和属)是不正确的。像 LCA 这样的算法通过根据与一组分类参考的匹配来改变预测的分类等级来避免这种情况。LCA 和相关方法可能比较保守,特别是如果最佳匹配在分类上分布广泛,因为横向基因转移(LGT)等事件。
我们对 LCA 的扩展称为 SPANNER(相似性图谱注释器),它使用给定序列的最佳同源匹配集(LCA 图谱),并将该图谱与从分类参考生物推断出的一组图谱进行比较。SPANNER 提供了一种分配,对 LGT 和其他混淆现象的敏感性较低。在一系列真实和人工数据集的试验中,SPANNER 在分类精度方面优于 LCA 风格的算法,并且在数据集的某些分类新颖性水平上优于最佳 BLAST。我们确定了 LCA 做出过度保守预测的例子,但 SPANNER 产生了更精确和正确的预测。
通过使用同源匹配的图谱来表示由于垂直和横向遗传而产生的基因组相似性模式,SPANNER 在基于最佳 BLAST 得分的分类分配和 LCA 及类似方法的保守方法之间提供了有效的折衷。
C++ 源代码和二进制文件可在 http://kiwi.cs.dal.ca/Software/SPANNER 上免费获得。
补充数据可在 Bioinformatics 在线获得。