• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

序列 reads 的修剪会改变 RNA-Seq 基因表达估计值。

Trimming of sequence reads alters RNA-Seq gene expression estimates.

作者信息

Williams Claire R, Baccarella Alyssa, Parrish Jay Z, Kim Charles C

机构信息

Department of Biology, University of Washington, Seattle, WA, 98195, USA.

Division of Experimental Medicine, Department of Medicine, University of California San Francisco, San Francisco, CA, 94110, USA.

出版信息

BMC Bioinformatics. 2016 Feb 25;17:103. doi: 10.1186/s12859-016-0956-2.

DOI:10.1186/s12859-016-0956-2
PMID:26911985
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4766705/
Abstract

BACKGROUND

High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is widely but heterogeneously applied is trimming, in which low quality bases, identified by the probability that they are called incorrectly, are removed. However, the impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias.

RESULTS

To assess the effects of trimming on gene expression, we generated RNA-Seq data sets from four samples of larval Drosophila melanogaster sensory neurons, and used three trimming algorithms--SolexaQA, Trimmomatic, and ConDeTri-to perform quality-based trimming across a wide range of stringencies. After aligning the reads to the D. melanogaster genome with TopHat2, we used Cuffdiff2 to compare the original, untrimmed gene expression estimates to those following trimming. With the most aggressive trimming parameters, over ten percent of genes had significant changes in their estimated expression levels. This trend was seen with two additional RNA-Seq data sets and with alternative differential expression analysis pipelines. We found that the majority of the expression changes could be mitigated by imposing a minimum length filter following trimming, suggesting that the differential gene expression was primarily being driven by spurious mapping of short reads. Slight differences with the untrimmed data set remained after length filtering, which were associated with genes with low exon numbers and high GC content. Finally, an analysis of paired RNA-seq/microarray data sets suggests that no or modest trimming results in the most biologically accurate gene expression estimates.

CONCLUSIONS

We find that aggressive quality-based trimming has a large impact on the apparent makeup of RNA-Seq-based gene expression estimates, and that short reads can have a particularly strong impact. We conclude that implementation of trimming in RNA-Seq analysis workflows warrants caution, and if used, should be used in conjunction with a minimum read length filter to minimize the introduction of unpredictable changes in expression estimates.

摘要

背景

高通量RNA测序(RNA-Seq)已成为研究生物样本间基因表达差异以及发现新异构体的首选技术,不过分析所得数据的技术仍不成熟。一个广泛但应用方式各异的预处理步骤是修剪,即通过碱基被错误识别的概率来确定低质量碱基并将其去除。然而,修剪对后续与基因组比对的影响可能会影响包括基因表达估计在内的下游分析;我们推测这种情况在不同基因间可能以不一致的方式发生,从而导致差异偏差。

结果

为评估修剪对基因表达的影响,我们从黑腹果蝇幼虫感觉神经元的四个样本中生成了RNA-Seq数据集,并使用三种修剪算法——SolexaQA、Trimmomatic和ConDeTri——在广泛的严格度范围内进行基于质量的修剪。在用TopHat2将 reads 比对到黑腹果蝇基因组后,我们使用Cuffdiff2比较原始的、未修剪的基因表达估计值与修剪后的估计值。使用最激进的修剪参数时,超过10%的基因在其估计表达水平上有显著变化。在另外两个RNA-Seq数据集以及替代的差异表达分析流程中也观察到了这种趋势。我们发现,通过在修剪后施加最小长度过滤,大多数表达变化可以得到缓解,这表明差异基因表达主要是由短 reads 的错误比对驱动的。长度过滤后,与未修剪数据集仍存在细微差异,这些差异与外显子数量少和GC含量高的基因有关。最后,对配对的RNA-seq/微阵列数据集的分析表明,不进行修剪或适度修剪会得到最符合生物学实际的基因表达估计值。

结论

我们发现,基于质量的激进修剪对基于RNA-Seq的基因表达估计的表观组成有很大影响,并且短 reads 可能有特别强烈的影响。我们得出结论,在RNA-Seq分析工作流程中实施修剪需要谨慎,如果使用,应与最小读长过滤结合使用,以尽量减少引入表达估计中不可预测的变化。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/d72aeb1b18c7/12859_2016_956_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/a8e80a29df1e/12859_2016_956_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/5349da7d01ac/12859_2016_956_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/fe04833d6e1e/12859_2016_956_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/d7459e25cf89/12859_2016_956_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/1f495fbc79c4/12859_2016_956_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/1db09754f012/12859_2016_956_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/d72aeb1b18c7/12859_2016_956_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/a8e80a29df1e/12859_2016_956_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/5349da7d01ac/12859_2016_956_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/fe04833d6e1e/12859_2016_956_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/d7459e25cf89/12859_2016_956_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/1f495fbc79c4/12859_2016_956_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/1db09754f012/12859_2016_956_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45c5/4766705/d72aeb1b18c7/12859_2016_956_Fig7_HTML.jpg

相似文献

1
Trimming of sequence reads alters RNA-Seq gene expression estimates.序列 reads 的修剪会改变 RNA-Seq 基因表达估计值。
BMC Bioinformatics. 2016 Feb 25;17:103. doi: 10.1186/s12859-016-0956-2.
2
Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data.重复读数和低复杂度区域对RNA测序和染色质免疫沉淀测序数据的有害影响。
BMC Bioinformatics. 2015;16 Suppl 13(Suppl 13):S10. doi: 10.1186/1471-2105-16-S13-S10. Epub 2015 Sep 25.
3
Estimates of allele-specific expression in Drosophila with a single genome sequence and RNA-seq data.使用单个基因组序列和 RNA-seq 数据估计果蝇中的等位基因特异性表达。
Bioinformatics. 2014 Sep 15;30(18):2603-10. doi: 10.1093/bioinformatics/btu342. Epub 2014 May 19.
4
Read trimming is not required for mapping and quantification of RNA-seq reads at the gene level.在基因水平上对RNA测序读数进行比对和定量时,无需进行读数修剪。
NAR Genom Bioinform. 2020 Sep 3;2(3):lqaa068. doi: 10.1093/nargab/lqaa068. eCollection 2020 Sep.
5
Bias and Correction in RNA-seq Data for Marine Species.海洋物种 RNA-seq 数据中的偏差与校正。
Mar Biotechnol (NY). 2017 Oct;19(5):541-550. doi: 10.1007/s10126-017-9773-5. Epub 2017 Sep 7.
6
Read trimming has minimal effect on bacterial SNP-calling accuracy.reads 修剪对细菌 SNP 调用准确性的影响最小。
Microb Genom. 2020 Dec;6(12). doi: 10.1099/mgen.0.000434. Epub 2020 Dec 11.
7
Evaluation of tools for long read RNA-seq splice-aware alignment.长读 RNA-seq 剪接感知比对工具评估。
Bioinformatics. 2018 Mar 1;34(5):748-754. doi: 10.1093/bioinformatics/btx668.
8
SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis.SPARTA:用于基于参考的细菌RNA测序转录组自动分析的简单程序。
BMC Bioinformatics. 2016 Feb 4;17:66. doi: 10.1186/s12859-016-0923-y.
9
Accurate estimation of expression levels of homologous genes in RNA-seq experiments.RNA测序实验中同源基因表达水平的准确估计。
J Comput Biol. 2011 Mar;18(3):459-68. doi: 10.1089/cmb.2010.0259.
10
A Novel Method to Detect Bias in Short Read NGS Data.一种检测短读长二代测序数据偏差的新方法。
J Integr Bioinform. 2017 Sep 23;14(3):/j/jib.2017.14.issue-3/jib-2017-0025/jib-2017-0025.xml. doi: 10.1515/jib-2017-0025.

引用本文的文献

1
Integration of Bulk RNA-seq Pipeline Metrics for Assessing Low-Quality Samples.整合批量RNA测序流程指标以评估低质量样本
Res Sq. 2025 Jul 3:rs.3.rs-6976695. doi: 10.21203/rs.3.rs-6976695/v1.
2
Cross-feeding interactions between and the glycan forager .[具体名称1]与聚糖觅食者之间的交叉喂养相互作用。 (注:原文中“and the glycan forager”前缺少具体主体,这里用[具体名称1]代替,实际翻译时应补充完整准确的信息)
bioRxiv. 2025 Jun 19:2025.06.18.660387. doi: 10.1101/2025.06.18.660387.
3
Transcriptomic analysis of CDL-gated photoperiodic flowering mechanisms in cannabis and their responsiveness to R: FR ratios in controlled environment agriculture.

本文引用的文献

1
Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data.重复读数和低复杂度区域对RNA测序和染色质免疫沉淀测序数据的有害影响。
BMC Bioinformatics. 2015;16 Suppl 13(Suppl 13):S10. doi: 10.1186/1471-2105-16-S13-S10. Epub 2015 Sep 25.
2
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.使用DESeq2对RNA测序数据的倍数变化和离散度进行适度估计。
Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8.
3
Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing.
大麻中CDL门控光周期开花机制的转录组分析及其在可控环境农业中对红:远红比例的响应
Sci Rep. 2025 May 21;15(1):17628. doi: 10.1038/s41598-025-00430-7.
4
Inflammatory cytokine upd3 induces axon length-dependent synapse removal by glia.炎症细胞因子upd3诱导神经胶质细胞进行轴突长度依赖性突触清除。
Proc Natl Acad Sci U S A. 2025 May 27;122(21):e2422752122. doi: 10.1073/pnas.2422752122. Epub 2025 May 20.
5
epidermal cells are intrinsically mechanosensitive and modulate nociceptive behavioral outputs.表皮细胞具有内在的机械敏感性,并调节伤害性的行为输出。
Elife. 2025 May 12;13:RP95379. doi: 10.7554/eLife.95379.
6
Building better genome annotations across the tree of life.构建跨越生命之树的更优基因组注释。
Genome Res. 2025 May 2;35(5):1261-1276. doi: 10.1101/gr.280377.124.
7
Evolved and Plastic Gene Expression in Adaptation of a Specialist Fly to a Novel Niche.一种专性蝇类适应新生态位过程中的进化与可塑性基因表达
Mol Ecol. 2025 Feb;34(4):e17653. doi: 10.1111/mec.17653. Epub 2025 Jan 9.
8
Large-scale proteogenomics characterization of microproteins in Mycobacterium tuberculosis.结核分枝杆菌中微小蛋白的大规模蛋白质基因组学特征分析
Sci Rep. 2024 Dec 28;14(1):31186. doi: 10.1038/s41598-024-82465-w.
9
Analysis of the Pattern of RNA Expression in the Skin of TR-Deficient Mice By RNA-seq.通过 RNA-seq 分析 TR 缺陷小鼠皮肤中的 RNA 表达模式。
Methods Mol Biol. 2025;2876:151-162. doi: 10.1007/978-1-0716-4252-8_10.
10
Protective Effects of Keratinocyte-Derived GCSF and CCL20 on UVB-Induced Melanocyte Damage.角质形成细胞衍生的 GCSF 和 CCL20 对 UVB 诱导的黑素细胞损伤的保护作用。
Cells. 2024 Oct 8;13(19):1661. doi: 10.3390/cells13191661.
大规模单细胞 RNA 测序对感觉神经元类型进行无偏分类。
Nat Neurosci. 2015 Jan;18(1):145-53. doi: 10.1038/nn.3881. Epub 2014 Nov 24.
4
HTSeq--a Python framework to work with high-throughput sequencing data.HTSeq——一个用于处理高通量测序数据的Python框架。
Bioinformatics. 2015 Jan 15;31(2):166-9. doi: 10.1093/bioinformatics/btu638. Epub 2014 Sep 25.
5
An RNA-sequencing transcriptome and splicing database of glia, neurons, and vascular cells of the cerebral cortex.大脑皮层神经胶质细胞、神经元和血管细胞的 RNA 测序转录组和剪接数据库。
J Neurosci. 2014 Sep 3;34(36):11929-47. doi: 10.1523/JNEUROSCI.1860-14.2014.
6
The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance.RNA测序与微阵列数据之间的一致性取决于化学处理和转录本丰度。
Nat Biotechnol. 2014 Sep;32(9):926-32. doi: 10.1038/nbt.3001. Epub 2014 Aug 24.
7
A comparative study of techniques for differential expression analysis on RNA-Seq data.RNA测序数据差异表达分析技术的比较研究
PLoS One. 2014 Aug 13;9(8):e103207. doi: 10.1371/journal.pone.0103207. eCollection 2014.
8
Development of the embryonic and larval peripheral nervous system of Drosophila.果蝇胚胎和幼虫外周神经系统的发育
Wiley Interdiscip Rev Dev Biol. 2014 May-Jun;3(3):193-210. doi: 10.1002/wdev.135. Epub 2014 Apr 16.
9
Realistic artificial DNA sequences as negative controls for computational genomics.用于计算基因组学的逼真人工DNA序列作为阴性对照
Nucleic Acids Res. 2014 Jul;42(12):e99. doi: 10.1093/nar/gku356. Epub 2014 May 6.
10
Trimmomatic: a flexible trimmer for Illumina sequence data.Trimmomatic:一款适用于 Illumina 测序数据的灵活修剪工具。
Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1.