Department of Statistics and Department of Health Research and Policy, Stanford University, Stanford, CA 94305.
Proc Natl Acad Sci U S A. 2013 Dec 10;110(50):E4821-30. doi: 10.1073/pnas.1320101110. Epub 2013 Nov 26.
Although transcriptional and posttranscriptional events are detected in RNA-Seq data from second-generation sequencing, full-length mRNA isoforms are not captured. On the other hand, third-generation sequencing, which yields much longer reads, has current limitations of lower raw accuracy and throughput. Here, we combine second-generation sequencing and third-generation sequencing with a custom-designed method for isoform identification and quantification to generate a high-confidence isoform dataset for human embryonic stem cells (hESCs). We report 8,084 RefSeq-annotated isoforms detected as full-length and an additional 5,459 isoforms predicted through statistical inference. Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified. Further characterization of the novel loci indicates that a subset is expressed in pluripotent cells but not in diverse fetal and adult tissues; moreover, their reduced expression perturbs the network of pluripotency-associated genes. Results suggest that gene identification, even in well-characterized human cell lines and tissues, is likely far from complete.
尽管第二代测序的 RNA-Seq 数据中可以检测到转录和转录后事件,但全长 mRNA 异构体并未被捕获。另一方面,第三代测序产生的读长长得多,但目前存在原始准确性和通量较低的限制。在这里,我们将第二代测序和第三代测序与一种自定义的异构体识别和定量方法相结合,为人类胚胎干细胞 (hESC) 生成高可信度的异构体数据集。我们报告了 8084 个全长 RefSeq 注释异构体和通过统计推断预测的另外 5459 个异构体。这些异构体中超过三分之一是新的异构体,包括 273 个来自先前未鉴定基因座的 RNA。对新基因座的进一步表征表明,其中一部分在多能细胞中表达,但在不同的胎儿和成体组织中不表达;此外,它们的表达减少扰乱了与多能性相关的基因网络。结果表明,即使在特征明确的人类细胞系和组织中,基因鉴定也可能远未完成。