Kozlov Alexey M, Zhang Jiajie, Yilmaz Pelin, Glöckner Frank Oliver, Stamatakis Alexandros
The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany
The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany.
Nucleic Acids Res. 2016 Jun 20;44(11):5022-33. doi: 10.1093/nar/gkw396. Epub 2016 May 10.
Molecular sequences in public databases are mostly annotated by the submitting authors without further validation. This procedure can generate erroneous taxonomic sequence labels. Mislabeled sequences are hard to identify, and they can induce downstream errors because new sequences are typically annotated using existing ones. Furthermore, taxonomic mislabelings in reference sequence databases can bias metagenetic studies which rely on the taxonomy. Despite significant efforts to improve the quality of taxonomic annotations, the curation rate is low because of the labor-intensive manual curation process. Here, we present SATIVA, a phylogeny-aware method to automatically identify taxonomically mislabeled sequences ('mislabels') using statistical models of evolution. We use the Evolutionary Placement Algorithm (EPA) to detect and score sequences whose taxonomic annotation is not supported by the underlying phylogenetic signal, and automatically propose a corrected taxonomic classification for those. Using simulated data, we show that our method attains high accuracy for identification (96.9% sensitivity/91.7% precision) as well as correction (94.9% sensitivity/89.9% precision) of mislabels. Furthermore, an analysis of four widely used microbial 16S reference databases (Greengenes, LTP, RDP and SILVA) indicates that they currently contain between 0.2% and 2.5% mislabels. Finally, we use SATIVA to perform an in-depth evaluation of alternative taxonomies for Cyanobacteria. SATIVA is freely available at https://github.com/amkozlov/sativa.
公共数据库中的分子序列大多由提交作者进行注释,未作进一步验证。这一过程可能会产生错误的分类学序列标签。错误标记的序列很难识别,而且由于新序列通常是使用现有序列进行注释的,所以它们会导致下游错误。此外,参考序列数据库中的分类学错误标记会使依赖分类学的宏基因组研究产生偏差。尽管人们为提高分类学注释的质量付出了巨大努力,但由于人工整理过程劳动强度大,整理率仍然很低。在这里,我们介绍了SATIVA,一种基于系统发育的方法,它使用进化统计模型自动识别分类学上错误标记的序列(“错误标签”)。我们使用进化定位算法(EPA)来检测和评分那些分类注释不被潜在系统发育信号支持的序列,并自动为这些序列提出一个校正后的分类学分类。通过模拟数据,我们表明我们的方法在错误标签的识别(灵敏度96.9%/精确率91.7%)和校正(灵敏度94.9%/精确率89.9%)方面都达到了很高的准确率。此外,对四个广泛使用的微生物16S参考数据库(Greengenes、LTP、RDP和SILVA)的分析表明,它们目前包含0.2%至2.5%的错误标签。最后,我们使用SATIVA对蓝细菌的替代分类法进行了深入评估。SATIVA可在https://github.com/amkozlov/sativa上免费获取。