School of Earth and Space Exploration, Arizona State University, Tempe, Arizona, United States of America.
PLoS One. 2013 Jul 1;8(7):e67337. doi: 10.1371/journal.pone.0067337. Print 2013.
Oligonucleotide signatures, especially tetranucleotide signatures, have been used as method for homology binning by exploiting an organism's inherent biases towards the use of specific oligonucleotide words. Tetranucleotide signatures have been especially useful in environmental metagenomics samples as many of these samples contain organisms from poorly classified phyla which cannot be easily identified using traditional homology methods, including NCBI BLAST. This study examines oligonucleotide signatures across 1,424 completed genomes from across the tree of life, substantially expanding upon previous work. A comprehensive analysis of mononucleotide through nonanucleotide word lengths suggests that longer word lengths substantially improve the classification of DNA fragments across a range of sizes of relevance to high throughput sequencing. We find that, at present, heptanucleotide signatures represent an optimal balance between prediction accuracy and computational time for resolving taxonomy using both genomic and metagenomic fragments. We directly compare the ability of tetranucleotide and heptanucleotide world lengths (tetranucleotide signatures are the current standard for oligonucleotide word usage analyses) for taxonomic binning of metagenome reads. We present evidence that heptanucleotide word lengths consistently provide more taxonomic resolving power, particularly in distinguishing between closely related organisms that are often present in metagenomic samples. This implies that longer oligonucleotide word lengths should replace tetranucleotide signatures for most analyses. Finally, we show that the application of longer word lengths to metagenomic datasets leads to more accurate taxonomic binning of DNA scaffolds and have the potential to substantially improve taxonomic assignment and assembly of metagenomic data.
寡核苷酸特征,尤其是四核苷酸特征,已被用作同源分组的方法,利用生物体对特定寡核苷酸的固有偏好。四核苷酸特征在环境宏基因组样本中特别有用,因为这些样本中包含许多来自分类较差的门的生物体,这些生物体不能使用传统的同源方法(包括 NCBI BLAST)进行轻易识别。本研究检查了跨越生命之树的 1424 个完整基因组中的寡核苷酸特征,大大扩展了以前的工作。对从单核苷酸到非核苷酸的单词长度的全面分析表明,较长的单词长度可以大大提高对与高通量测序相关的一系列大小的 DNA 片段的分类。我们发现,目前,在使用基因组和宏基因组片段解决分类学问题时,七核苷酸特征在预测准确性和计算时间之间达到了最佳平衡。我们直接比较四核苷酸和七核苷酸单词长度(四核苷酸特征是目前寡核苷酸单词使用分析的标准)在宏基因组读段分类中的能力。我们有证据表明,七核苷酸单词长度始终提供更高的分类分辨率能力,特别是在区分宏基因组样本中经常存在的密切相关的生物体方面。这意味着,对于大多数分析来说,较长的寡核苷酸单词长度应该取代四核苷酸特征。最后,我们表明,将更长的单词长度应用于宏基因组数据集可以更准确地对 DNA 支架进行分类,并有潜力大大提高宏基因组数据的分类分配和组装。