Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA.
Genome Biol. 2018 Oct 30;19(1):165. doi: 10.1186/s13059-018-1554-6.
In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.
为了确定数据库在分类序列分类中的作用,我们考察了数据库随时间的变化对基于 k-mer 的最小编辑共同祖先分类的影响。我们得出了三个主要发现:添加到 NCBI RefSeq 数据库中的新物种数量大大超过了新属的数量;因此,更多的读取内容可以使用更新的数据库版本进行分类,但在物种水平上的分类却更少;基于贝叶斯的重新估计可以缓解这种影响,但对于新的基因组则较为困难。这些结果表明,需要专门针对大型数据库开发新的分类方法。
Bioinformatics. 2024-3-29
BMC Bioinformatics. 2016-1-16
BMC Bioinformatics. 2017-5-10
Genome Res. 2024-10-11
Bioinformatics. 2016-4-1
NAR Genom Bioinform. 2025-7-17
Bioinform Adv. 2025-5-6
NAR Genom Bioinform. 2025-6-9
Microbiology (Reading). 2025-5
AMIA Annu Symp Proc. 2025-5-22
Brief Bioinform. 2025-5-1
Genome Biol. 2018-11-16
Nat Biotechnol. 2018-8-27
PeerJ. 2018-6-12
Microbiome. 2018-1-18
Bioinformatics. 2018-3-1
Genome Biol. 2017-9-21
Genome Announc. 2017-8-10
Front Microbiol. 2017-5-9