Petersen Thomas Nordahl, Lukjancenko Oksana, Thomsen Martin Christen Frølund, Maddalena Sperotto Maria, Lund Ole, Møller Aarestrup Frank, Sicheritz-Pontén Thomas
Department of Bio and Health Informatics, Technical University of Denmark, Kongens Lyngby, Denmark.
National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark.
PLoS One. 2017 May 3;12(5):e0176469. doi: 10.1371/journal.pone.0176469. eCollection 2017.
An increasing amount of species and gene identification studies rely on the use of next generation sequence analysis of either single isolate or metagenomics samples. Several methods are available to perform taxonomic annotations and a previous metagenomics benchmark study has shown that a vast number of false positive species annotations are a problem unless thresholds or post-processing are applied to differentiate between correct and false annotations. MGmapper is a package to process raw next generation sequence data and perform reference based sequence assignment, followed by a post-processing analysis to produce reliable taxonomy annotation at species and strain level resolution. An in-vitro bacterial mock community sample comprised of 8 genuses, 11 species and 12 strains was previously used to benchmark metagenomics classification methods. After applying a post-processing filter, we obtained 100% correct taxonomy assignments at species and genus level. A sensitivity and precision at 75% was obtained for strain level annotations. A comparison between MGmapper and Kraken at species level, shows MGmapper assigns taxonomy at species level using 84.8% of the sequence reads, compared to 70.5% for Kraken and both methods identified all species with no false positives. Extensive read count statistics are provided in plain text and excel sheets for both rejected and accepted taxonomy annotations. The use of custom databases is possible for the command-line version of MGmapper, and the complete pipeline is freely available as a bitbucked package (https://bitbucket.org/genomicepidemiology/mgmapper). A web-version (https://cge.cbs.dtu.dk/services/MGmapper) provides the basic functionality for analysis of small fastq datasets.
越来越多的物种和基因鉴定研究依赖于对单个分离株或宏基因组样本进行下一代序列分析。有几种方法可用于进行分类注释,之前的一项宏基因组基准研究表明,除非应用阈值或后处理来区分正确和错误的注释,否则大量的假阳性物种注释是一个问题。MGmapper是一个用于处理原始下一代序列数据并执行基于参考的序列分配的软件包,随后进行后处理分析,以在物种和菌株水平分辨率上产生可靠的分类注释。先前使用由8个属、11个物种和12个菌株组成的体外细菌模拟群落样本对宏基因组分类方法进行基准测试。应用后处理过滤器后,我们在物种和属水平上获得了100%正确的分类分配。在菌株水平注释方面,灵敏度和精确度达到了75%。MGmapper和Kraken在物种水平上的比较表明,MGmapper使用84.8%的序列读数在物种水平上分配分类,而Kraken为70.5%,两种方法都识别出所有物种且无假阳性。对于被拒绝和被接受的分类注释,以纯文本和电子表格形式提供了广泛的读数计数统计信息。MGmapper的命令行版本可以使用自定义数据库,完整的流程作为一个bitbucked包(https://bitbucket.org/genomicepidemiology/mgmapper)免费提供。网络版本(https://cge.cbs.dtu.dk/services/MGmapper)提供了分析小型fastq数据集的基本功能。