Suppr超能文献

鉴定无链 RNA-seq 数据中基因表达估计的不准确之处。

Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data.

机构信息

Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, United States.

Department of Medicine, University of California San Diego, La Jolla, CA, United States.

出版信息

Sci Rep. 2019 Nov 8;9(1):16342. doi: 10.1038/s41598-019-52584-w.

Abstract

RNA-seq methods are widely utilized for transcriptomic profiling of biological samples. However, there are known caveats of this technology which can skew the gene expression estimates. Specifically, if the library preparation protocol does not retain RNA strand information then some genes can be erroneously quantitated. Although strand-specific protocols have been established, a significant portion of RNA-seq data is generated in non-strand-specific manner. We used a comprehensive stranded RNA-seq dataset of 15 blood cell types to identify genes for which expression would be erroneously estimated if strand information was not available. We found that about 10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored. We used parameters of read alignments of these genes to construct a machine learning model that can identify which genes in an unstranded dataset might have incorrect expression estimates and which ones do not. We also show that differential expression analysis of genes with biased expression estimates in unstranded read data can be recovered by limiting the reads considered to those which span exonic boundaries. The resulting approach is implemented as a package available at https://github.com/mikpom/uslcount .

摘要

RNA-seq 方法被广泛用于生物样本的转录组分析。然而,该技术存在已知的局限性,可能会扭曲基因表达的估计。具体来说,如果文库制备方案不保留 RNA 链信息,那么一些基因的定量就会出现错误。尽管已经建立了特异性协议,但相当一部分 RNA-seq 数据是以非特异性方式生成的。我们使用了一个全面的有向 RNA-seq 数据集,其中包含 15 种血细胞类型,以确定如果没有链信息,哪些基因的表达估计会出现错误。我们发现,大约 10%的基因和 2.5%的蛋白质编码基因,如果忽略读取的链信息,其表达估计会有两倍或更高的差异。我们使用这些基因的读取比对参数构建了一个机器学习模型,可以识别无向数据集哪些基因的表达估计可能不正确,哪些基因则没有。我们还表明,通过限制考虑跨越外显子边界的读取,可以恢复无向读取数据中具有偏向表达估计的基因的差异表达分析。该方法已实现为一个软件包,可在 https://github.com/mikpom/uslcount 上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/02f1/6841694/324e9b372274/41598_2019_52584_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验