鉴定无链 RNA-seq 数据中基因表达估计的不准确之处。

Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data.

机构信息

Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, CA, United States.

Department of Medicine, University of California San Diego, La Jolla, CA, United States.

出版信息

Sci Rep. 2019 Nov 8;9(1):16342. doi: 10.1038/s41598-019-52584-w.

DOI:10.1038/s41598-019-52584-w

PMID:31704962

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6841694/

Abstract

RNA-seq methods are widely utilized for transcriptomic profiling of biological samples. However, there are known caveats of this technology which can skew the gene expression estimates. Specifically, if the library preparation protocol does not retain RNA strand information then some genes can be erroneously quantitated. Although strand-specific protocols have been established, a significant portion of RNA-seq data is generated in non-strand-specific manner. We used a comprehensive stranded RNA-seq dataset of 15 blood cell types to identify genes for which expression would be erroneously estimated if strand information was not available. We found that about 10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored. We used parameters of read alignments of these genes to construct a machine learning model that can identify which genes in an unstranded dataset might have incorrect expression estimates and which ones do not. We also show that differential expression analysis of genes with biased expression estimates in unstranded read data can be recovered by limiting the reads considered to those which span exonic boundaries. The resulting approach is implemented as a package available at https://github.com/mikpom/uslcount .

摘要

RNA-seq 方法被广泛用于生物样本的转录组分析。然而，该技术存在已知的局限性，可能会扭曲基因表达的估计。具体来说，如果文库制备方案不保留 RNA 链信息，那么一些基因的定量就会出现错误。尽管已经建立了特异性协议，但相当一部分 RNA-seq 数据是以非特异性方式生成的。我们使用了一个全面的有向 RNA-seq 数据集，其中包含 15 种血细胞类型，以确定如果没有链信息，哪些基因的表达估计会出现错误。我们发现，大约 10%的基因和 2.5%的蛋白质编码基因，如果忽略读取的链信息，其表达估计会有两倍或更高的差异。我们使用这些基因的读取比对参数构建了一个机器学习模型，可以识别无向数据集哪些基因的表达估计可能不正确，哪些基因则没有。我们还表明，通过限制考虑跨越外显子边界的读取，可以恢复无向读取数据中具有偏向表达估计的基因的差异表达分析。该方法已实现为一个软件包，可在 https://github.com/mikpom/uslcount 上获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/02f1/6841694/324e9b372274/41598_2019_52584_Fig1_HTML.jpg

相似文献

Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data.

Sci Rep. 2019 Nov 8;9(1):16342. doi: 10.1038/s41598-019-52584-w.

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):43. doi: 10.1186/s12859-017-1471-9.

Read-Split-Run: an improved bioinformatics pipeline for identification of genome-wide non-canonical spliced regions using RNA-Seq data.

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):503. doi: 10.1186/s12864-016-2896-7.

HSA: a heuristic splice alignment tool.

BMC Syst Biol. 2013;7 Suppl 2(Suppl 2):S10. doi: 10.1186/1752-0509-7-S2-S10. Epub 2013 Dec 17.

RNA-Seq Experiment and Data Analysis.

Methods Mol Biol. 2016;1366:99-114. doi: 10.1007/978-1-4939-3127-9_9.

Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap.

BMC Genomics. 2015 Sep 3;16(1):675. doi: 10.1186/s12864-015-1876-7.

High-confidence coding and noncoding transcriptome maps.

Genome Res. 2017 Jun;27(6):1050-1062. doi: 10.1101/gr.214288.116. Epub 2017 Apr 10.

Differentially expressed genes from RNA-Seq and functional enrichment results are affected by the choice of single-end versus paired-end reads and stranded versus non-stranded protocols.

BMC Genomics. 2017 May 23;18(1):399. doi: 10.1186/s12864-017-3797-0.

htsint: a Python library for sequencing pipelines that combines data through gene set generation.

BMC Bioinformatics. 2015 Sep 24;16:307. doi: 10.1186/s12859-015-0729-3.

A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa.

PLoS Comput Biol. 2018 Apr 9;14(4):e1006053. doi: 10.1371/journal.pcbi.1006053. eCollection 2018 Apr.

引用本文的文献

Advances in Gene Therapy with Oncolytic Viruses and CAR-T Cells and Therapy-Related Groups.

Curr Issues Mol Biol. 2025 Apr 10;47(4):268. doi: 10.3390/cimb47040268.

Reprisal of to Mn stress and exploration of its defense mechanism through transcriptomic analysis.

Front Plant Sci. 2022 Oct 6;13:1022686. doi: 10.3389/fpls.2022.1022686. eCollection 2022.

Reply to: Target expression is a relevant factor in synthetic lethal screens.

Commun Biol. 2022 Aug 19;5(1):836. doi: 10.1038/s42003-022-03747-5.

Molecular basis of ocean acidification sensitivity and adaptation in .

iScience. 2022 Jun 27;25(8):104677. doi: 10.1016/j.isci.2022.104677. eCollection 2022 Aug 19.

how_are_we_stranded_here: quick determination of RNA-Seq strandedness.

BMC Bioinformatics. 2022 Jan 22;23(1):49. doi: 10.1186/s12859-022-04572-7.

Genetic Association Between Schizophrenia and Cortical Brain Surface Area and Thickness.

JAMA Psychiatry. 2021 Sep 1;78(9):1020-1030. doi: 10.1001/jamapsychiatry.2021.1435.

Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols.

RNA. 2020 Aug;26(8):903-909. doi: 10.1261/rna.074922.120. Epub 2020 Apr 13.

本文引用的文献

Impact of Genetic Polymorphisms on Human Immune Cell Gene Expression.

Cell. 2018 Nov 29;175(6):1701-1715.e16. doi: 10.1016/j.cell.2018.10.022. Epub 2018 Nov 15.

Increasing quality, throughput and speed of sample preparation for strand-specific messenger RNA sequencing.

BMC Genomics. 2017 Jul 5;18(1):515. doi: 10.1186/s12864-017-3900-6.

Differentially expressed genes from RNA-Seq and functional enrichment results are affected by the choice of single-end versus paired-end reads and stranded versus non-stranded protocols.

BMC Genomics. 2017 May 23;18(1):399. doi: 10.1186/s12864-017-3797-0.

Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap.

BMC Genomics. 2015 Sep 3;16(1):675. doi: 10.1186/s12864-015-1876-7.

The impact of read length on quantification of differentially expressed genes and splice junction detection.

Genome Biol. 2015 Jun 23;16(1):131. doi: 10.1186/s13059-015-0697-y.

Mechanisms and Regulation of Alternative Pre-mRNA Splicing.

Annu Rev Biochem. 2015;84:291-323. doi: 10.1146/annurev-biochem-060614-034316. Epub 2015 Mar 12.

Proteomics. Tissue-based map of the human proteome.

Science. 2015 Jan 23;347(6220):1260419. doi: 10.1126/science.1260419.

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8.

HTSeq--a Python framework to work with high-throughput sequencing data.

Bioinformatics. 2015 Jan 15;31(2):166-9. doi: 10.1093/bioinformatics/btu638. Epub 2014 Sep 25.

Analysis of stranded information using an automated procedure for strand specific RNA sequencing.

BMC Genomics. 2014 Jul 28;15(1):631. doi: 10.1186/1471-2164-15-631.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

鉴定无链 RNA-seq 数据中基因表达估计的不准确之处。

Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献