Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, Genetics Institute, University of Florida, Gainesville, Florida 32611, USA.
Genomics of Gene Expression Laboratory, Centro de Investigaciones Principe Felipe (CIPF), 46012 Valencia, Spain.
Genome Res. 2018 Mar 1;28(3):396-411. doi: 10.1101/gr.222976.117.
High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.
利用长读长进行全长转录本的高通量测序为发现数千种新的转录本铺平了道路,即使在注释良好的哺乳动物物种中也是如此。测序技术的进步催生了需要研究和工具来描述这些新的变体。在这里,我们提出了 SQANTI,这是一种用于长读转录本分类的自动化流水线,可以使用 47 个独特的描述符评估数据和预处理流水线的质量。我们使用 Pacific Biosciences (PacBio) 长读长对神经元小鼠转录组进行了 SQANTI 分析,并说明了该工具如何有效地描述和描述全长转录组的组成。我们通过 PCR 对 ToFU PacBio 转录本进行了广泛的评估,结果表明,大量新转录本是测序方法的技术伪影,并且 SQANTI 质量描述符可用于设计过滤策略来去除它们。这个经过精心整理的转录组中的大多数新转录本是现有剪接位点的新组合,导致新的 ORF 比新的 UTR 更频繁出现,并且在一般代谢和神经特异性功能中都富集。我们表明,这些新转录本对基于最新短读长的定量算法的转录本水平的正确定量有重大影响。通过将我们的同转录组与公共蛋白质组学数据库进行比较,我们发现替代异构体难以通过蛋白质基因组学检测到。SQANTI 通过提供质量评估和精心整理的全长转录组工具,允许用户最大限度地发挥长读长技术的分析结果。