Poncheewin Wasin, Hermes Gerben D A, van Dam Jesse C J, Koehorst Jasper J, Smidt Hauke, Schaap Peter J
Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Wageningen, Netherlands.
Laboratory of Microbiology, Wageningen University & Research, Wageningen, Netherlands.
Front Genet. 2020 Jan 23;10:1366. doi: 10.3389/fgene.2019.01366. eCollection 2019.
NG-Tax 2.0 is a semantic framework for FAIR high-throughput analysis and classification of marker gene amplicon sequences including bacterial and archaeal 16S ribosomal RNA (rRNA), eukaryotic 18S rRNA and ribosomal intergenic transcribed spacer sequences. It can directly use single or merged reads, paired-end reads and unmerged paired-end reads from long range fragments as input to generate amplicon sequence variants (ASV). Using the RDF data model, ASV's can be automatically stored in a graph database as objects that link ASV sequences with the full data-wise and element-wise provenance, thereby achieving the level of interoperability required to utilize such data to its full potential. The graph database can be directly queried, allowing for comparative analyses of over thousands of samples and is connected with an interactive Rshiny toolbox for analysis and visualization of (meta) data. Additionally, NG-Tax 2.0 exports an extended BIOM 1.0 (JSON) file as starting point for further analyses by other means. The extended BIOM file contains new attribute types to include information about the command arguments used, the sequences of the ASVs formed, classification confidence scores and is backwards compatible. The performance of NG-Tax 2.0 was compared with DADA2, using the plugin in the QIIME 2 analysis pipeline. Fourteen 16S rRNA gene amplicon mock community samples were obtained from the literature and evaluated. Precision of NG-Tax 2.0 was significantly higher with an average of 0.95 vs 0.58 for QIIME2-DADA2 while recall was comparable with an average of 0.85 and 0.77, respectively. NG-Tax 2.0 is written in Java. The code, the ontology, a Galaxy platform implementation, the analysis toolbox, tutorials and example SPARQL queries are freely available at http://wurssb.gitlab.io/ngtax under the MIT License.
NG-Tax 2.0是一个用于FAIR高通量分析和标记基因扩增子序列分类的语义框架,这些序列包括细菌和古菌的16S核糖体RNA(rRNA)、真核生物的18S rRNA以及核糖体基因间隔转录序列。它可以直接将来自长片段的单读段或合并读段、双端读段和未合并双端读段作为输入,以生成扩增子序列变体(ASV)。使用RDF数据模型,ASV可以作为将ASV序列与完整的数据级和元素级来源相链接的对象自动存储在图形数据库中,从而实现充分利用此类数据所需的互操作性水平。该图形数据库可以直接查询,允许对数千个样本进行比较分析,并与一个交互式Rshiny工具箱相连,用于(元)数据的分析和可视化。此外,NG-Tax 2.0会导出一个扩展的BIOM 1.0(JSON)文件,作为通过其他方式进行进一步分析的起点。扩展的BIOM文件包含新的属性类型,以纳入有关所用命令参数、形成的ASV序列、分类置信度分数的信息,并且具有向后兼容性。使用QIIME 2分析管道中的插件,将NG-Tax 2.0的性能与DADA2进行了比较。从文献中获取并评估了14个16S rRNA基因扩增子模拟群落样本。NG-Tax 2.0的精度显著更高,平均为0.95,而QIIME2-DADA2为0.58,同时召回率相当,分别平均为0.85和0.77。NG-Tax 2.0用Java编写。代码、本体、Galaxy平台实现、分析工具箱、教程和示例SPARQL查询可在http://wurssb.gitlab.io/ngtax上根据MIT许可免费获取。