Greenfield Paul, Tran-Dinh Nai, Midgley David
Commonwealth Scientific and Industrial Research Organisation, North Ryde, NSW, Australia.
School of Biological Sciences, Macquarie University, Australia.
PeerJ. 2019 Jan 30;6:e6174. doi: 10.7717/peerj.6174. eCollection 2019.
Whole-metagenome sequencing can be a rich source of information about the structure and function of entire metagenomic communities, but getting accurate and reliable results from these datasets can be challenging. Analysis of these datasets is founded on the mapping of sequencing reads onto known genomic regions from known organisms, but short reads will often map equally well to multiple regions, and to multiple reference organisms. Assembling metagenomic datasets prior to mapping can generate much longer and more precisely mappable sequences but the presence of closely related organisms and highly conserved regions makes metagenomic assembly challenging, and some regions of particular interest can assemble poorly. One solution to these problems is to use specialised tools, such as Kelpie, that can accurately extract and assemble full-length sequences for defined genomic regions from whole-metagenome datasets.
Kelpie is a kMer-based tool that generates full-length amplicon-like sequences from whole-metagenome datasets. It takes a pair of primer sequences and a set of metagenomic reads, and uses a combination of kMer filtering, error correction and assembly techniques to construct sets of full-length inter-primer sequences.
The effectiveness of Kelpie is demonstrated here through the extraction and assembly of full-length ribosomal marker gene regions, as this allows comparisons with conventional amplicon sequencing and published metagenomic benchmarks. The results show that the Kelpie-generated sequences and community profiles closely match those produced by amplicon sequencing, down to low abundance levels, and running Kelpie on the synthetic CAMI metagenomic benchmarking datasets shows similar high levels of both precision and recall.
Kelpie can be thought of as being somewhat like an PCR tool, taking a primer pair and producing the resulting 'amplicons' from a whole-metagenome dataset. Marker regions from the 16S rRNA gene were used here as an example because this allowed the overall accuracy of Kelpie to be evaluated through comparisons with other datasets, approaches and benchmarks. Kelpie is not limited to this application though, and can be used to extract and assemble any genomic region present in a whole metagenome dataset, as long as it is bound by a pairs of highly conserved primer sequences.
全基因组测序可以成为有关整个宏基因组群落结构和功能的丰富信息来源,但从这些数据集中获得准确可靠的结果可能具有挑战性。对这些数据集的分析基于将测序读数映射到已知生物体的已知基因组区域,但短读数通常会同样好地映射到多个区域以及多个参考生物体。在映射之前组装宏基因组数据集可以生成更长且更精确可映射的序列,但密切相关的生物体和高度保守区域的存在使得宏基因组组装具有挑战性,并且一些特别感兴趣的区域可能组装效果不佳。解决这些问题的一种方法是使用专门的工具,如Kelpie,它可以从全基因组数据集中准确提取和组装定义基因组区域的全长序列。
Kelpie是一种基于kMer 的工具,可从全基因组数据集中生成全长扩增子样序列。它采用一对引物序列和一组宏基因组读数,并使用kMer过滤、纠错和组装技术的组合来构建全长引物间序列集。
通过提取和组装全长核糖体标记基因区域证明了Kelpie的有效性,因为这允许与传统扩增子测序和已发表的宏基因组基准进行比较。结果表明,Kelpie生成的序列和群落概况与扩增子测序产生的序列和群落概况紧密匹配,直至低丰度水平,并且在合成的CAMI宏基因组基准数据集上运行Kelpie显示出相似的高精度和召回率。
可以认为Kelpie有点像PCR工具,采用一对引物并从全基因组数据集中产生所得的“扩增子”。这里以16S rRNA基因的标记区域为例,因为这允许通过与其他数据集、方法和基准进行比较来评估Kelpie的整体准确性。不过,Kelpie并不局限于此应用,只要由一对高度保守的引物序列界定,它可用于提取和组装全基因组数据集中存在的任何基因组区域。