Information and Computational Sciences, James Hutton Institute, Dundee, DD2 5DA, Scotland, UK.
The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, EH25 9RG, UK.
Genome Biol. 2022 Jul 7;23(1):149. doi: 10.1186/s13059-022-02711-0.
Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single-molecule long-read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation, or incomplete cDNA synthesis.
We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 169,000 transcripts-twice that of the best current Arabidopsis transcriptome and including over 1500 novel genes. Seventy-eight percent of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We develop novel methods to determine splice junctions and transcription start and end sites accurately. Mismatch profiles around splice junctions provide a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identify high-confidence transcription start and end sites and remove fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provides higher resolution of transcript expression profiling and identifies cold-induced differential transcription start and polyadenylation site usage.
AtRTD3 is the most comprehensive Arabidopsis transcriptome currently. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage analysis from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single-molecule sequencing analysis from any species.
准确而全面的转录本序列注释对于转录本定量和差异基因及转录本表达分析至关重要。单分子长读测序技术提供了改进的转录本结构完整性,包括可变剪接以及转录起始和聚腺苷酸化位点。然而,准确性受到测序错误、mRNA 降解或不完全 cDNA 合成的显著影响。
我们提出了一个新的、全面的拟南芥参考转录数据集 3(AtRTD3)。AtRTD3 包含超过 169000 个转录本,是目前最好的拟南芥转录组的两倍,其中包括 1500 多个新基因。78%的转录本来自 Iso-seq,具有准确定义的剪接接头以及转录起始和结束位点。我们开发了新的方法来准确确定剪接接头以及转录起始和结束位点。剪接接头周围的错配图谱提供了一种强大的特征,可以区分正确的剪接接头并去除错误的剪接接头。分层方法可以识别高可信度的转录起始和结束位点,并去除由于降解而产生的片段化转录本。AtRTD3 比现有的转录组有了很大的改进,这在对拟南芥冷响应 RNA-seq 时间序列的分析中得到了证明。AtRTD3 提供了更高分辨率的转录本表达谱,并确定了冷诱导的差异转录起始和多聚腺苷酸化位点使用。
AtRTD3 是目前最全面的拟南芥转录组。它提高了从 RNA-seq 数据中进行差异基因和转录本表达、差异可变剪接以及转录起始/结束位点使用分析的精度。用于识别准确剪接接头和转录起始/结束位点的新方法具有广泛的适用性,将提高来自任何物种的单分子测序分析的准确性。