Suppr超能文献

通过对异构体和外显子特异性读段测序率进行建模来改进RNA测序表达估计。

Improving RNA-Seq expression estimation by modeling isoform- and exon-specific read sequencing rate.

作者信息

Liu Xuejun, Shi Xinxin, Chen Chunlin, Zhang Li

机构信息

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, 29 Jiangjun Rd., Nanjing, 211106, China.

出版信息

BMC Bioinformatics. 2015 Oct 16;16:332. doi: 10.1186/s12859-015-0750-6.

Abstract

BACKGROUND

The high-throughput sequencing technology, RNA-Seq, has been widely used to quantify gene and isoform expression in the study of transcriptome in recent years. Accurate expression measurement from the millions or billions of short generated reads is obstructed by difficulties. One is ambiguous mapping of reads to reference transcriptome caused by alternative splicing. This increases the uncertainty in estimating isoform expression. The other is non-uniformity of read distribution along the reference transcriptome due to positional, sequencing, mappability and other undiscovered sources of biases. This violates the uniform assumption of read distribution for many expression calculation approaches, such as the direct RPKM calculation and Poisson-based models. Many methods have been proposed to address these difficulties. Some approaches employ latent variable models to discover the underlying pattern of read sequencing. However, most of these methods make bias correction based on surrounding sequence contents and share the bias models by all genes. They therefore cannot estimate gene- and isoform-specific biases as revealed by recent studies.

RESULTS

We propose a latent variable model, NLDMseq, to estimate gene and isoform expression. Our method adopts latent variables to model the unknown isoforms, from which reads originate, and the underlying percentage of multiple spliced variants. The isoform- and exon-specific read sequencing biases are modeled to account for the non-uniformity of read distribution, and are identified by utilizing the replicate information of multiple lanes of a single library run. We employ simulation and real data to verify the performance of our method in terms of accuracy in the calculation of gene and isoform expression. Results show that NLDMseq obtains competitive gene and isoform expression compared to popular alternatives. Finally, the proposed method is applied to the detection of differential expression (DE) to show its usefulness in the downstream analysis.

CONCLUSIONS

The proposed NLDMseq method provides an approach to accurately estimate gene and isoform expression from RNA-Seq data by modeling the isoform- and exon-specific read sequencing biases. It makes use of a latent variable model to discover the hidden pattern of read sequencing. We have shown that it works well in both simulations and real datasets, and has competitive performance compared to popular methods. The method has been implemented as a freely available software which can be found at https://github.com/PUGEA/NLDMseq.

摘要

背景

近年来,高通量测序技术RNA-Seq已广泛应用于转录组研究中的基因和异构体表达定量。从数百万或数十亿条短序列 reads 中准确测量表达受到诸多困难的阻碍。一是由于可变剪接导致 reads 与参考转录组的映射模糊,这增加了异构体表达估计的不确定性。另一个是由于位置、测序、可映射性和其他未发现的偏差来源,reads 沿参考转录组的分布不均匀。这违反了许多表达计算方法(如直接RPKM计算和基于泊松的模型)对reads分布的均匀假设。已经提出了许多方法来解决这些困难。一些方法采用潜在变量模型来发现reads测序的潜在模式。然而,这些方法大多基于周围序列内容进行偏差校正,并且所有基因共享偏差模型。因此,它们无法像最近的研究所揭示的那样估计基因和异构体特异性偏差。

结果

我们提出了一种潜在变量模型NLDMseq来估计基因和异构体表达。我们的方法采用潜在变量对 reads 来源的未知异构体以及多个剪接变体的潜在百分比进行建模。对异构体和外显子特异性的 reads 测序偏差进行建模以考虑 reads 分布的不均匀性,并通过利用单个文库运行的多个泳道的重复信息来识别。我们使用模拟和真实数据来验证我们的方法在基因和异构体表达计算准确性方面的性能。结果表明,与流行的替代方法相比,NLDMseq获得了具有竞争力的基因和异构体表达。最后,将所提出的方法应用于差异表达(DE)检测,以显示其在下游分析中的有用性。

结论

所提出的NLDMseq方法通过对异构体和外显子特异性的 reads 测序偏差进行建模,提供了一种从RNA-Seq数据中准确估计基因和异构体表达的方法。它利用潜在变量模型来发现reads测序的隐藏模式。我们已经表明,它在模拟和真实数据集中都表现良好,并且与流行方法相比具有竞争力。该方法已实现为可免费获取的软件,可在https://github.com/PUGEA/NLDMseq找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e7f/4609108/52e302830af9/12859_2015_750_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验