Clemente José C, Jansson Jesper, Valiente Gabriel
Center for Information Biology and DNA Databank of Japan, National Institute of Genetics, Yata 1111, Mishima, Japan.
Pac Symp Biocomput. 2010:3-9. doi: 10.1142/9789814295291_0002.
Ambiguities in the taxonomy dependent assignment of pyrosequencing reads are usually resolved by mapping each read to the lowest common ancestor in a reference taxonomy of all those sequences that match the read. This conservative approach has the drawback of mapping a read to a possibly large clade that may also contain many sequences not matching the read. A more accurate taxonomic assignment of short reads can be made by mapping each read to the node in the reference taxonomy that provides the best precision and recall. We show that given a suffix array for the sequences in the reference taxonomy, a short read can be mapped to the node of the reference taxonomy with the best combined value of precision and recall in time linear in the size of the taxonomy subtree rooted at the lowest common ancestor of the matching sequences. An accurate taxonomic assignment of short reads can thus be made with about the same efficiency as when mapping each read to the lowest common ancestor of all matching sequences in a reference taxonomy. We demonstrate the effectiveness of our approach on several metagenomic datasets of marine and gut microbiota.
焦磷酸测序读数基于分类法的分配中的模糊性通常通过将每个读数映射到与该读数匹配的所有序列的参考分类法中的最低共同祖先来解决。这种保守方法的缺点是将一个读数映射到一个可能很大的进化枝,该进化枝可能还包含许多与该读数不匹配的序列。通过将每个读数映射到参考分类法中提供最佳精确率和召回率的节点,可以对短读数进行更准确的分类分配。我们表明,给定参考分类法中序列的后缀数组,一个短读数可以在以匹配序列的最低共同祖先为根的分类法子树大小的线性时间内,映射到参考分类法中具有最佳精确率和召回率组合值的节点。因此,对短读数进行准确的分类分配的效率与将每个读数映射到参考分类法中所有匹配序列的最低共同祖先时的效率大致相同。我们在几个海洋和肠道微生物群的宏基因组数据集上证明了我们方法的有效性。