Larsen Mette V, Cosentino Salvatore, Lukjancenko Oksana, Saputra Dhany, Rasmussen Simon, Hasman Henrik, Sicheritz-Pontén Thomas, Aarestrup Frank M, Ussery David W, Lund Ole
Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kongens Lyngby, Denmark.
J Clin Microbiol. 2014 May;52(5):1529-39. doi: 10.1128/JCM.02981-13. Epub 2014 Feb 26.
One of the first issues that emerges when a prokaryotic organism of interest is encountered is the question of what it is--that is, which species it is. The 16S rRNA gene formed the basis of the first method for sequence-based taxonomy and has had a tremendous impact on the field of microbiology. Nevertheless, the method has been found to have a number of shortcomings. In the current study, we trained and benchmarked five methods for whole-genome sequence-based prokaryotic species identification on a common data set of complete genomes: (i) SpeciesFinder, which is based on the complete 16S rRNA gene; (ii) Reads2Type that searches for species-specific 50-mers in either the 16S rRNA gene or the gyrB gene (for the Enterobacteraceae family); (iii) the ribosomal multilocus sequence typing (rMLST) method that samples up to 53 ribosomal genes; (iv) TaxonomyFinder, which is based on species-specific functional protein domain profiles; and finally (v) KmerFinder, which examines the number of cooccurring k-mers (substrings of k nucleotides in DNA sequence data). The performances of the methods were subsequently evaluated on three data sets of short sequence reads or draft genomes from public databases. In total, the evaluation sets constituted sequence data from more than 11,000 isolates covering 159 genera and 243 species. Our results indicate that methods that sample only chromosomal, core genes have difficulties in distinguishing closely related species which only recently diverged. The KmerFinder method had the overall highest accuracy and correctly identified from 93% to 97% of the isolates in the evaluations sets.
当遇到感兴趣的原核生物时,首先出现的问题之一是它是什么——也就是说,它属于哪个物种。16S rRNA基因构成了基于序列的分类学第一种方法的基础,并且对微生物学领域产生了巨大影响。然而,该方法已被发现存在一些缺点。在当前的研究中,我们在一个完整基因组的公共数据集上对五种基于全基因组序列的原核生物物种鉴定方法进行了训练和基准测试:(i)基于完整16S rRNA基因的SpeciesFinder;(ii)在16S rRNA基因或gyrB基因(针对肠杆菌科)中搜索物种特异性50聚体的Reads2Type;(iii)对多达53个核糖体基因进行采样的核糖体多位点序列分型(rMLST)方法;(iv)基于物种特异性功能蛋白结构域图谱的TaxonomyFinder;最后(v)检查共现k聚体(DNA序列数据中k个核苷酸的子串)数量的KmerFinder。随后在来自公共数据库的三个短序列读数或草图基因组数据集上评估了这些方法的性能。总的来说,评估集构成了来自超过11,000个分离株的序列数据,涵盖159个属和243个物种。我们的结果表明,仅对染色体核心基因进行采样的方法在区分最近才分化的密切相关物种方面存在困难。KmerFinder方法总体准确率最高,在评估集中正确鉴定了93%至97%的分离株。