Universitat Autònoma de Barcelona, Cerdanyola del Vallès, Spain.
CREAF, Cerdanyola del Vallès, Spain.
PLoS One. 2022 Oct 25;17(10):e0275790. doi: 10.1371/journal.pone.0275790. eCollection 2022.
The use of high-throughput sequencing to recover short DNA reads of many species has been widely applied on biodiversity studies, either as amplicon metabarcoding or shotgun metagenomics. These reads are assigned to taxa using classifiers. However, for different reasons, the results often contain many false positives. Here we focus on the reduction of false positive species attributable to the classifiers. We benchmarked two popular classifiers, BLASTn followed by MEGAN6 (BM) and Kraken2 (K2), to analyse shotgun sequenced artificial single-species samples of insects. To reduce the number of misclassified reads, we combined the output of the two classifiers in two different ways: (1) by keeping only the reads that were attributed to the same species by both classifiers (intersection approach); and (2) by keeping the reads assigned to some species by any classifier (union approach). In addition, we applied an analytical detection limit to further reduce the number of false positives species. As expected, both metagenomic classifiers used with default parameters generated an unacceptably high number of misidentified species (tens with BM, hundreds with K2). The false positive species were not necessarily phylogenetically close, as some of them belonged to different orders of insects. The union approach failed to reduce the number of false positives, but the intersection approach got rid of most of them. The addition of an analytic detection limit of 0.001 further reduced the number to ca. 0.5 false positive species per sample. The misidentification of species by most classifiers hampers the confidence of the DNA-based methods for assessing the biodiversity of biological samples. Our approach to alleviate the problem is straightforward and significantly reduced the number of reported false positive species.
高通量测序技术用于恢复许多物种的短 DNA 读取,已广泛应用于生物多样性研究,包括扩增子代谢组学或 shotgun 宏基因组学。这些读取使用分类器分配给分类单元。然而,由于各种原因,结果通常包含许多假阳性。在这里,我们专注于减少由于分类器而产生的假阳性物种。我们对两种流行的分类器 BLASTn 随后是 MEGAN6(BM)和 Kraken2(K2)进行了基准测试,以分析昆虫 shotgun 测序的人工单物种样本。为了减少错误分类的读取数量,我们以两种不同的方式组合了两个分类器的输出:(1)仅保留两个分类器都归因于同一物种的读取(交集方法);(2)保留任何分类器分配给某些物种的读取(并集方法)。此外,我们应用了分析检测限进一步减少假阳性物种的数量。正如预期的那样,使用默认参数的两种宏基因组分类器都会产生数量不可接受的假阳性物种(BM 为几十个,K2 为几百个)。假阳性物种不一定在系统发育上接近,因为其中一些属于昆虫的不同目。并集方法未能减少假阳性物种的数量,但交集方法去除了其中的大部分。添加分析检测限 0.001 进一步将数量减少到每个样本约 0.5 个假阳性物种。大多数分类器对物种的误识别阻碍了基于 DNA 的方法评估生物样本生物多样性的可信度。我们减轻该问题的方法很简单,显著减少了报告的假阳性物种数量。