Riquier Sébastien, Bessiere Chloé, Guibert Benoit, Bouge Anne-Laure, Boureux Anthony, Ruffle Florence, Audoux Jérôme, Gilbert Nicolas, Xue Haoliang, Gautheret Daniel, Commes Thérèse
IRMB, University of Montpellier, INSERM, 80 rue Augustin Fliche, 34295, Montpellier, France.
SeqOne, 34000, Montpellier, France.
NAR Genom Bioinform. 2021 Jun 23;3(3):lqab058. doi: 10.1093/nargab/lqab058. eCollection 2021 Sep.
The huge body of publicly available RNA-sequencing (RNA-seq) libraries is a treasure of functional information allowing to quantify the expression of known or novel transcripts in tissues. However, transcript quantification commonly relies on alignment methods requiring a lot of computational resources and processing time, which does not scale easily to large datasets. -mer decomposition constitutes a new way to process RNA-seq data for the identification of transcriptional signatures, as -mers can be used to quantify accurately gene expression in a less resource-consuming way. We present the Kmerator Suite, a set of three tools designed to extract specific -mer signatures, quantify these -mers into RNA-seq datasets and quickly visualize large dataset characteristics. The core tool, Kmerator, produces specific -mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets. KmerExploR, a direct application of Kmerator, uses a set of predictor gene-specific -mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets. KmerExploR results are visualized through a user-friendly interface. Moreover, we demonstrate that the Kmerator Suite can be used for advanced queries targeting known or new biomarkers such as mutations, gene fusions or long non-coding RNAs for human health applications.
大量公开可用的RNA测序(RNA-seq)文库是一个功能信息宝库,可用于量化组织中已知或新转录本的表达。然而,转录本定量通常依赖于比对方法,这需要大量的计算资源和处理时间,难以轻松扩展到大型数据集。k-mer分解构成了一种处理RNA-seq数据以识别转录特征的新方法,因为k-mer可用于以资源消耗较少的方式准确量化基因表达。我们展示了Kmerator套件,这是一组三个工具,旨在提取特定的k-mer特征,将这些k-mer量化到RNA-seq数据集中,并快速可视化大型数据集特征。核心工具Kmerator为97%的人类基因生成特定的k-mer,从而能够在模拟数据集中高精度地测量基因表达。KmerExploR是Kmerator的直接应用,它使用一组预测基因特异性k-mer从RNA-seq数据集中推断元数据,包括文库协议、样本特征或污染情况。KmerExploR的结果通过用户友好界面进行可视化。此外,我们证明Kmerator套件可用于针对已知或新生物标志物(如突变、基因融合或长链非编码RNA)的高级查询,以用于人类健康应用。