Department of Anaesthesiology, HELIOS University Hospital Wuppertal, University of Witten/Herdecke, Heusnerstr. 40, 42283 Wuppertal, Germany.
Institut fur Virologie, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany.
Int J Mol Sci. 2018 Nov 21;19(11):3687. doi: 10.3390/ijms19113687.
We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as a diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool for RNAseq experiments. Further exploration is required once samples are identified as outliers in HC derived trees.
我们应用 DNA k- -mer 计数的层次聚类 (HC) 对多个 Fastq 文件进行分析。HC 产生的树结构可能反映了实验分组,从而表明存在实验效应,但制备分组的聚类则表明存在批次效应。因此,DNA k- -mer 计数的 HC 可以作为一种诊断工具。为了提供一个简单适用的工具,我们在 Bioconductor 上的 R 包(seqTools)中实现了低内存使用的 Fastq 读取的顺序分析。该方法通过分析包含 RNAseq 数据的 Fastq 文件批次得到验证。对从 ArrayExpress 下载的三个 Fastq 批次的分析表明存在实验效应。对我们实验室测序的两种细胞类型(真皮成纤维细胞和 Jurkat 细胞)的 RNAseq 数据的分析表明存在批次效应。在映射到人类基因组的读取中以及在过滤得到高质量(Phred>30)的读取中也观察到了批次效应。我们提出,DNA k- -mer 计数的层次聚类为 RNAseq 实验提供了一种非特异性的诊断工具。一旦在 HC 衍生树中确定样本为异常值,就需要进一步探索。