重复读数和低复杂度区域对RNA测序和染色质免疫沉淀测序数据的有害影响。

Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data.

作者信息

Dozmorov Mikhail G, Adrianto Indra, Giles Cory B, Glass Edmund, Glenn Stuart B, Montgomery Courtney, Sivils Kathy L, Olson Lorin E, Iwayama Tomoaki, Freeman Willard M, Lessard Christopher J, Wren Jonathan D

出版信息

BMC Bioinformatics. 2015;16 Suppl 13(Suppl 13):S10. doi: 10.1186/1471-2105-16-S13-S10. Epub 2015 Sep 25.

DOI:10.1186/1471-2105-16-S13-S10

PMID:26423047

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4597324/

Abstract

BACKGROUND

Adapter trimming and removal of duplicate reads are common practices in next-generation sequencing pipelines. Sequencing reads ambiguously mapped to repetitive and low complexity regions can also be problematic for accurate assessment of the biological signal, yet their impact on sequencing data has not received much attention. We investigate how trimming the adapters, removing duplicates, and filtering out reads overlapping low complexity regions influence the significance of biological signal in RNA- and ChIP-seq experiments.

METHODS

We assessed the effect of data processing steps on the alignment statistics and the functional enrichment analysis results of RNA- and ChIP-seq data. We compared differentially processed RNA-seq data with matching microarray data on the same patient samples to determine whether changes in pre-processing improved correlation between the two. We have developed a simple tool to remove low complexity regions, RepeatSoaker, available at https://github.com/mdozmorov/RepeatSoaker, and tested its effect on the alignment statistics and the results of the enrichment analyses.

RESULTS

Both adapter trimming and duplicate removal moderately improved the strength of biological signals in RNA-seq and ChIP-seq data. Aggressive filtering of reads overlapping with low complexity regions, as defined by RepeatMasker, further improved the strength of biological signals, and the correlation between RNA-seq and microarray gene expression data.

CONCLUSIONS

Adapter trimming and duplicates removal, coupled with filtering out reads overlapping low complexity regions, is shown to increase the quality and reliability of detecting biological signals in RNA-seq and ChIP-seq data.

摘要

背景

接头修剪和去除重复读取是新一代测序流程中的常见操作。测序读取模糊映射到重复和低复杂度区域对于生物信号的准确评估也可能存在问题，但其对测序数据的影响尚未受到太多关注。我们研究了接头修剪、去除重复以及过滤掉与低复杂度区域重叠的读取如何影响RNA测序和染色质免疫沉淀测序（ChIP-seq）实验中生物信号的显著性。

方法

我们评估了数据处理步骤对RNA测序和ChIP-seq数据的比对统计和功能富集分析结果的影响。我们将经过不同处理的RNA测序数据与同一患者样本上匹配的微阵列数据进行比较，以确定预处理的变化是否改善了两者之间的相关性。我们开发了一个简单的工具来去除低复杂度区域，即RepeatSoaker，可在https://github.com/mdozmorov/RepeatSoaker获取，并测试了其对比对统计和富集分析结果的影响。

结果

接头修剪和去除重复都适度提高了RNA测序和ChIP-seq数据中生物信号的强度。如RepeatMasker所定义的那样，对与低复杂度区域重叠的读取进行积极过滤进一步提高了生物信号的强度以及RNA测序与微阵列基因表达数据之间的相关性。

结论

接头修剪、去除重复以及过滤掉与低复杂度区域重叠的读取可提高RNA测序和ChIP-seq数据中生物信号检测的质量和可靠性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/faf1/4597324/cdd2ee422c26/1471-2105-16-S13-S10-1.jpg

相似文献

Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data.

BMC Bioinformatics. 2015;16 Suppl 13(Suppl 13):S10. doi: 10.1186/1471-2105-16-S13-S10. Epub 2015 Sep 25.

Trimming of sequence reads alters RNA-Seq gene expression estimates.

BMC Bioinformatics. 2016 Feb 25;17:103. doi: 10.1186/s12859-016-0956-2.

dupRadar: a Bioconductor package for the assessment of PCR artifacts in RNA-Seq data.

BMC Bioinformatics. 2016 Oct 21;17(1):428. doi: 10.1186/s12859-016-1276-2.

SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis.

BMC Bioinformatics. 2016 Feb 4;17:66. doi: 10.1186/s12859-016-0923-y.

Incorporation of unique molecular identifiers in TruSeq adapters improves the accuracy of quantitative sequencing.

Biotechniques. 2017 Nov 1;63(5):221-226. doi: 10.2144/000114608.

Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data.

BMC Bioinformatics. 2019 Dec 27;20(Suppl 23):606. doi: 10.1186/s12859-019-3280-9.

PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm.

BMC Bioinformatics. 2015;16 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2105-16-S1-S2. Epub 2015 Jan 21.

QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization.

BMC Genomics. 2016 Jan 8;17:39. doi: 10.1186/s12864-015-2356-9.

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads.

Gigascience. 2015 Oct 19;4:48. doi: 10.1186/s13742-015-0089-y. eCollection 2015.

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads.

BMC Genomics. 2015;16 Suppl 7(Suppl 7):S14. doi: 10.1186/1471-2164-16-S7-S14. Epub 2015 Jun 11.

引用本文的文献

Identification of transcription factor co-binding patterns with non-negative matrix factorization.

Nucleic Acids Res. 2024 Oct 14;52(18):e85. doi: 10.1093/nar/gkae743.

Improved analysis of (e)CLIP data with RCRUNCH yields a compendium of RNA-binding protein binding sites and motifs.

Genome Biol. 2023 Apr 17;24(1):77. doi: 10.1186/s13059-023-02913-0.

The life cycle-dependent transcriptional profile of the obligate intracellular amoeba symbiont Amoebophilus asiaticus.

FEMS Microbiol Ecol. 2022 Feb 10;98(1). doi: 10.1093/femsec/fiac001.

Transcriptome-Wide Analyses Identify Dominant as the Predominantly Non-Conservative Alternative Splicing Inheritance Patterns in F1 Chickens.

Front Genet. 2021 Dec 3;12:774240. doi: 10.3389/fgene.2021.774240. eCollection 2021.

Conserved DNA sequence features underlie pervasive RNA polymerase pausing.

Nucleic Acids Res. 2021 May 7;49(8):4402-4420. doi: 10.1093/nar/gkab208.

Direct Nanopore Sequencing of mRNA Reveals Landscape of Transcript Isoforms in Apicomplexan Parasites.

mSystems. 2021 Mar 9;6(2):e01081-20. doi: 10.1128/mSystems.01081-20.

Inheritance patterns of the transcriptome in hybrid chickens and their parents revealed by expression analysis.

Sci Rep. 2019 Apr 8;9(1):5750. doi: 10.1038/s41598-019-42019-x.

RNA-sequencing in ophthalmology research: considerations for experimental design and analysis.

Ther Adv Ophthalmol. 2019 Mar 15;11:2515841419835460. doi: 10.1177/2515841419835460. eCollection 2019 Jan-Dec.

unitas: the universal tool for annotation of small RNAs.

BMC Genomics. 2017 Aug 22;18(1):644. doi: 10.1186/s12864-017-4031-9.

Allele-Specific Expression Analysis Does Not Support Sex Chromosome Inactivation on the Chicken Z Chromosome.

Genome Biol Evol. 2017 Mar 1;9(3):619-626. doi: 10.1093/gbe/evx031.

本文引用的文献

Detection of gene rearrangements in targeted clinical next-generation sequencing.

J Mol Diagn. 2014 Jul;16(4):405-17. doi: 10.1016/j.jmoldx.2014.03.006. Epub 2014 May 9.

Trimmomatic: a flexible trimmer for Illumina sequence data.

Bioinformatics. 2014 Aug 1;30(15):2114-20. doi: 10.1093/bioinformatics/btu170. Epub 2014 Apr 1.

Detection of structural variants involving repetitive regions in the reference genome.

J Comput Biol. 2014 Mar;21(3):219-33. doi: 10.1089/cmb.2013.0129. Epub 2014 Feb 19.

Reconstructing complex regions of genomes using long-read sequencing technology.

Genome Res. 2014 Apr;24(4):688-96. doi: 10.1101/gr.168450.113. Epub 2014 Jan 13.

Bias from removing read duplication in ultra-deep sequencing experiments.

Bioinformatics. 2014 Apr 15;30(8):1073-1080. doi: 10.1093/bioinformatics/btt771. Epub 2014 Jan 2.

An extensive evaluation of read trimming effects on Illumina NGS data analysis.

PLoS One. 2013 Dec 23;8(12):e85024. doi: 10.1371/journal.pone.0085024. eCollection 2013.

Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs.

PLoS Genet. 2013 Jun;9(6):e1003569. doi: 10.1371/journal.pgen.1003569. Epub 2013 Jun 20.

ENCODE data in the UCSC Genome Browser: year 5 update.

Nucleic Acids Res. 2013 Jan;41(Database issue):D56-63. doi: 10.1093/nar/gks1172. Epub 2012 Nov 27.

An integrated map of genetic variation from 1,092 human genomes.

Nature. 2012 Nov 1;491(7422):56-65. doi: 10.1038/nature11632.

ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions.

Nat Rev Genet. 2012 Dec;13(12):840-52. doi: 10.1038/nrg3306. Epub 2012 Oct 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

重复读数和低复杂度区域对RNA测序和染色质免疫沉淀测序数据的有害影响。

Detrimental effects of duplicate reads and low complexity regions on RNA- and ChIP-seq data.

作者信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献