Chan Chon-Kit Kenneth, Rosic Nedeljka, Lorenc Michał T, Visendi Paul, Lin Meng, Kaniewska Paulina, Ferguson Brett J, Gresshoff Peter M, Batley Jacqueline, Edwards David
School of Biological Sciences and Institute of Agriculture, The University of Western Australia, Perth, WA, 6009, Australia.
Australian Genome Research Facility, Melbourne, VIC, Australia.
Funct Integr Genomics. 2019 Mar;19(2):363-371. doi: 10.1007/s10142-018-0647-3. Epub 2018 Nov 27.
Next-generation DNA sequencing technologies, such as RNA-Seq, currently dominate genome-wide gene expression studies. A standard approach to analyse this data requires mapping sequence reads to a reference and counting the number of reads which map to each gene. However, for many transcriptome studies, a suitable reference genome is unavailable, especially for meta-transcriptome studies which assay gene expression from mixed populations of organisms. Where a reference is unavailable, it is possible to generate a reference by the de novo assembly of the sequence reads. However, the high cost of generating high-coverage data for de novo assembly hinders this approach and more importantly the accurate assembly of such data is challenging, especially for meta-transcriptome data, and resulting assemblies frequently suffer from collapsed regions or chimeric sequences. As an alternative to the standard reference mapping approach, we have developed a k-mer-based analysis pipeline (DiffKAP) to identify differentially expressed reads between RNA-Seq datasets without the requirement for a reference. We compared the DiffKAP approach with the traditional Tophat/Cuffdiff method using RNA-Seq data from soybean, which has a suitable reference genome. We subsequently examined differential gene expression for a coral meta-transcriptome where no reference is available, and validated the results using qRT-PCR. We conclude that DiffKAP is an accurate method to study differential gene expression in complex meta-transcriptomes without the requirement of a reference genome.
新一代DNA测序技术,如RNA测序(RNA-Seq),目前在全基因组基因表达研究中占据主导地位。分析此类数据的标准方法需要将序列读数映射到一个参考序列,并计算映射到每个基因的读数数量。然而,对于许多转录组研究而言,合适的参考基因组并不存在,尤其是对于那些从混合生物群体中分析基因表达的宏转录组研究。在没有参考基因组的情况下,可以通过对序列读数进行从头组装来生成一个参考序列。然而,为从头组装生成高覆盖度数据的高成本阻碍了这种方法,更重要的是,此类数据的准确组装具有挑战性,特别是对于宏转录组数据,而且生成的组装序列经常会出现区域塌陷或嵌合序列。作为标准参考映射方法的替代方案,我们开发了一种基于k-mer的分析流程(DiffKAP),用于在无需参考基因组的情况下识别RNA-Seq数据集之间差异表达的读数。我们使用来自大豆的RNA-Seq数据(大豆有合适的参考基因组),将DiffKAP方法与传统的Tophat/Cuffdiff方法进行了比较。随后,我们研究了一个没有参考基因组的珊瑚宏转录组的差异基因表达,并使用qRT-PCR验证了结果。我们得出结论,DiffKAP是一种准确的方法,可用于在无需参考基因组的情况下研究复杂宏转录组中的差异基因表达。