Suppr超能文献

异构体丰度推断能更准确地估计RNA测序中的基因表达水平。

Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-seq.

作者信息

Wang Xi, Wu Zhengpeng, Zhang Xuegong

机构信息

MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing, P R China.

出版信息

J Bioinform Comput Biol. 2010 Dec;8 Suppl 1:177-92. doi: 10.1142/s0219720010005178.

Abstract

Due to its unprecedented high-resolution and detailed information, RNA-seq technology based on next-generation high-throughput sequencing significantly boosts the ability to study transcriptomes. The estimation of genes' transcript abundance levels or gene expression levels has always been an important question in research on the transcriptional regulation and gene functions. On the basis of the concept of Reads Per Kilo-base per Million reads (RPKM), taking the union-intersection genes (UI-based) and summing up inferred isoform abundance (isoform-based) are the two current strategies to estimate gene expression levels, but produce different estimations. In this paper, we made the first attempt to compare the two strategies' performances through a series of simulation studies. Our results showed that the isoform-based method gives not only more accurate estimation but also has less uncertainty than the UI-based strategy. If taking into account the non-uniformity of read distribution, the isoform-based method can further reduce estimation errors. We applied both strategies to real RNA-seq datasets of technical replicates, and found that the isoform-based strategy also displays a better performance. For a more accurate estimation of gene expression levels from RNA-seq data, even if the abundance levels of isoforms are not of interest, it is still better to first infer the isoform abundance and sum them up to get the expression level of a gene as a whole.

摘要

由于基于新一代高通量测序的RNA测序(RNA-seq)技术具有前所未有的高分辨率和详细信息,它显著提高了研究转录组的能力。基因转录本丰度水平或基因表达水平的估计一直是转录调控和基因功能研究中的一个重要问题。基于每百万读段中每千碱基读段数(RPKM)的概念,采用并集交集基因(基于UI)和汇总推断的异构体丰度(基于异构体)是目前估计基因表达水平的两种策略,但会产生不同的估计结果。在本文中,我们首次尝试通过一系列模拟研究来比较这两种策略的性能。我们的结果表明,基于异构体的方法不仅给出了更准确的估计,而且比基于UI的策略具有更小的不确定性。如果考虑读段分布的不均匀性,基于异构体的方法可以进一步减少估计误差。我们将这两种策略应用于技术重复的真实RNA-seq数据集,发现基于异构体的策略也表现出更好的性能。为了从RNA-seq数据中更准确地估计基因表达水平,即使异构体的丰度水平不是我们感兴趣的,最好还是先推断异构体的丰度并将它们汇总起来,以得到一个基因整体的表达水平。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验