DAFS：一种用于 RNA-seq 数据的自适应标记方法，用于区分低表达和高表达基因。

DAFS: a data-adaptive flag method for RNA-sequencing data to differentiate genes with low and high expression.

机构信息

Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, FDA, Jefferson, AR 72079, USA.

出版信息

BMC Bioinformatics. 2014 Mar 31;15:92. doi: 10.1186/1471-2105-15-92.

DOI:10.1186/1471-2105-15-92

PMID:24685233

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4098771/

Abstract

BACKGROUND

Next-generation sequencing (NGS) has advanced the application of high-throughput sequencing technologies in genetic and genomic variation analysis. Due to the large dynamic range of expression levels, RNA-seq is more prone to detect transcripts with low expression. It is clear that genes with no mapped reads are not expressed; however, there is ongoing debate about the level of abundance that constitutes biologically meaningful expression. To date, there is no consensus on the definition of low expression. Since random variation is high in regions with low expression and distributions of transcript expression are affected by numerous experimental factors, methods to differentiate low and high expressed data in a sample are critical to interpreting classes of abundance levels in RNA-seq data.

RESULTS

A data-adaptive approach was developed to estimate the lower bound of high expression for RNA-seq data. The Kolmgorov-Smirnov statistic and multivariate adaptive regression splines were used to determine the optimal cutoff value for separating transcripts with high and low expression. Results from the proposed method were compared to results obtained by estimating the theoretical cutoff of a fitted two-component mixture distribution. The robustness of the proposed method was demonstrated by analyzing different RNA-seq datasets that varied by sequencing depth, species, scale of measurement, and empirical density shape.

CONCLUSIONS

The analysis of real and simulated data presented here illustrates the need to employ data-adaptive methodology in lieu of arbitrary cutoffs to distinguish low expressed RNA-seq data from high expression. Our results also present the drawbacks of characterizing the data by a two-component mixture distribution when classes of gene expression are not well separated. The ability to ascertain stably expressed RNA-seq data is essential in the filtering process of data analysis, and methodologies that consider the underlying data structure demonstrate superior performance in preserving most of the interpretable and meaningful data. The proposed algorithm for classifying low and high regions of transcript abundance promises wide-range application in the continuing development of RNA-seq analysis.

摘要

背景

下一代测序（NGS）技术已经推动了高通量测序技术在遗传和基因组变异分析中的应用。由于表达水平的动态范围很大，RNA-seq 更容易检测到低表达的转录本。显然，没有被映射到reads 的基因是不表达的；然而，对于构成有意义的表达的丰度水平，仍存在争议。迄今为止，对于低表达的定义还没有达成共识。由于低表达区域的随机变异较大，并且转录本表达的分布受到许多实验因素的影响，因此区分样本中低表达和高表达数据的方法对于解释 RNA-seq 数据中的丰度水平类别至关重要。

结果

开发了一种数据自适应方法来估计 RNA-seq 数据中高表达的下限。使用 Kolmogorov-Smirnov 统计量和多元自适应回归样条来确定区分高表达和低表达转录本的最佳截止值。与通过估计拟合的两分量混合分布的理论截止值获得的结果进行比较。通过分析不同的 RNA-seq 数据集，包括测序深度、物种、测量规模和经验密度形状的差异，证明了该方法的稳健性。

结论

本文对真实和模拟数据的分析表明，需要采用数据自适应方法，而不是任意的截止值，来区分低表达的 RNA-seq 数据和高表达的 RNA-seq 数据。我们的结果还表明，当基因表达的类别没有很好地分离时，用两分量混合分布来描述数据存在缺陷。确定稳定表达的 RNA-seq 数据的能力是数据分析过滤过程中的关键，并且考虑底层数据结构的方法在保留大部分可解释和有意义的数据方面表现出优越的性能。用于分类转录本丰度的低和高区域的建议算法有望在 RNA-seq 分析的不断发展中得到广泛应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f6d/4098771/72baebe70e0c/1471-2105-15-92-1.jpg

相似文献

DAFS: a data-adaptive flag method for RNA-sequencing data to differentiate genes with low and high expression.DAFS：一种用于 RNA-seq 数据的自适应标记方法，用于区分低表达和高表达基因。

BMC Bioinformatics. 2014 Mar 31;15:92. doi: 10.1186/1471-2105-15-92.

SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis.SPARTA：用于基于参考的细菌RNA测序转录组自动分析的简单程序。

BMC Bioinformatics. 2016 Feb 4;17:66. doi: 10.1186/s12859-016-0923-y.

Using RNentropy to Detect Significant Variation in Gene Expression Across Multiple RNA-Seq or Single-Cell RNA-Seq Samples.使用 RNentropy 检测多个 RNA-Seq 或单细胞 RNA-Seq 样本中基因表达的显著变化。

Methods Mol Biol. 2021;2284:77-96. doi: 10.1007/978-1-0716-1307-8_6.

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads.通过纳入非外显子映射读数对RNA测序数据进行差异表达分析。

BMC Genomics. 2015;16 Suppl 7(Suppl 7):S14. doi: 10.1186/1471-2164-16-S7-S14. Epub 2015 Jun 11.

Benchmarking differential expression analysis tools for RNA-Seq: normalization-based vs. log-ratio transformation-based methods.RNA-Seq 差异表达分析工具的基准测试：基于标准化与基于对数比变换的方法。

BMC Bioinformatics. 2018 Jul 18;19(1):274. doi: 10.1186/s12859-018-2261-8.

Accurate estimation of expression levels of homologous genes in RNA-seq experiments.RNA测序实验中同源基因表达水平的准确估计。

J Comput Biol. 2011 Mar;18(3):459-68. doi: 10.1089/cmb.2010.0259.

Next-generation sequencing facilitates quantitative analysis of wild-type and Nrl(-/-) retinal transcriptomes.新一代测序技术有助于对野生型和Nrl基因敲除小鼠视网膜转录组进行定量分析。

Mol Vis. 2011;17:3034-54. Epub 2011 Nov 23.

A mixture model for expression deconvolution from RNA-seq in heterogeneous tissues.一种用于异质组织中 RNA-seq 表达解卷积的混合模型。

BMC Bioinformatics. 2013;14 Suppl 5(Suppl 5):S11. doi: 10.1186/1471-2105-14-S5-S11. Epub 2013 Apr 10.

Comparative evaluation of gene set analysis approaches for RNA-Seq data.RNA测序数据基因集分析方法的比较评估

BMC Bioinformatics. 2014 Dec 5;15(1):397. doi: 10.1186/s12859-014-0397-8.

Classifying next-generation sequencing data using a zero-inflated Poisson model.使用零膨胀泊松模型对下一代测序数据进行分类。

Bioinformatics. 2018 Apr 15;34(8):1329-1335. doi: 10.1093/bioinformatics/btx768.

引用本文的文献

and mRNA expression-based risk stratification of acute Myeloid Leukemia.以及基于mRNA表达的急性髓系白血病风险分层

Leuk Res Rep. 2024 Dec 19;23:100494. doi: 10.1016/j.lrr.2024.100494. eCollection 2025.

The role of heterochronic gene expression and regulatory architecture in early developmental divergence.在早期发育分歧中，异时基因表达和调控结构的作用。

Elife. 2024 Aug 23;13:RP93062. doi: 10.7554/eLife.93062.

The BulkECexplorer compiles endothelial bulk transcriptomes to predict functional versus leaky transcription.BulkECexplorer汇编内皮细胞整体转录组，以预测功能性转录与渗漏性转录。

Nat Cardiovasc Res. 2024 Mar 13;3:460-473. doi: 10.1038/s44161-024-00436-w.

Single-chromosome dynamics reveals locus-dependent dynamics and chromosome territory orientation.单染色体动力学揭示了依赖于基因座的动力学和染色体区域取向。

J Cell Sci. 2023 Feb 15;136(4). doi: 10.1242/jcs.260137. Epub 2023 Feb 27.

Characterizing microglial gene expression in a model of secondary progressive multiple sclerosis.在继发性进展型多发性硬化症模型中对小胶质细胞基因表达进行特征分析。

Glia. 2023 Mar;71(3):588-601. doi: 10.1002/glia.24297. Epub 2022 Nov 15.

Brain macrophages acquire distinct transcriptomes in multiple sclerosis lesions and normal appearing white matter.脑巨噬细胞在多发性硬化症病变和正常表现的白质中获得不同的转录组。

Acta Neuropathol Commun. 2022 Jan 28;10(1):8. doi: 10.1186/s40478-021-01306-3.

Identification of distinct and age-dependent p16 microglia subtypes.鉴定不同且具有年龄依赖性的 p16 小胶质细胞亚型。

Aging Cell. 2021 Oct;20(10):e13450. doi: 10.1111/acel.13450. Epub 2021 Oct 1.

Profiling Microglia From Alzheimer's Disease Donors and Non-demented Elderly in Acute Human Postmortem Cortical Tissue.对阿尔茨海默病捐赠者和非痴呆老年人急性人类死后皮质组织中的小胶质细胞进行分析。

Front Mol Neurosci. 2020 Oct 28;13:134. doi: 10.3389/fnmol.2020.00134. eCollection 2020.

Urinary Sediment Transcriptomic and Longitudinal Data to Investigate Renal Function Decline in Type 1 Diabetes.应用尿液沉淀转录组学和纵向数据来研究 1 型糖尿病患者肾功能下降。

Front Endocrinol (Lausanne). 2020 Apr 30;11:238. doi: 10.3389/fendo.2020.00238. eCollection 2020.

Custom selected reference genes outperform pre-defined reference genes in transcriptomic analysis.自定义选择的参考基因在转录组分析中优于预定义的参考基因。

BMC Genomics. 2020 Jan 10;21(1):35. doi: 10.1186/s12864-019-6426-2.

本文引用的文献

Quantitative assessment of single-cell RNA-sequencing methods.单细胞 RNA 测序方法的定量评估。

Nat Methods. 2014 Jan;11(1):41-6. doi: 10.1038/nmeth.2694. Epub 2013 Oct 20.

Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing.使用 RNA 测序检测差异表达的高效实验设计和分析策略。

BMC Genomics. 2012 Sep 17;13:484. doi: 10.1186/1471-2164-13-484.

Transcriptome-wide regulation of pre-mRNA splicing and mRNA localization by muscleblind proteins.肌肉盲蛋白对前体 mRNA 剪接和 mRNA 定位的转录组范围调控。

Cell. 2012 Aug 17;150(4):710-24. doi: 10.1016/j.cell.2012.06.041.

Statistical methods on detecting differentially expressed genes for RNA-seq data.用于检测RNA测序数据中差异表达基因的统计方法。

BMC Syst Biol. 2011;5 Suppl 3(Suppl 3):S1. doi: 10.1186/1752-0509-5-S3-S1. Epub 2011 Dec 23.

GC-content normalization for RNA-Seq data.RNA-Seq 数据的 GC 含量归一化。

BMC Bioinformatics. 2011 Dec 17;12:480. doi: 10.1186/1471-2105-12-480.

ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets.ReCount：一个可分析的 RNA-seq 基因计数数据集的多实验资源。

BMC Bioinformatics. 2011 Nov 16;12:449. doi: 10.1186/1471-2105-12-449.

RNA sequencing reveals two major classes of gene expression levels in metazoan cells.RNA 测序揭示了后生动物细胞中两种主要的基因表达水平类别。

Mol Syst Biol. 2011 Jun 7;7:497. doi: 10.1038/msb.2011.28.

Analysis and simulation of gene expression profiles in pure and mixed cell populations.纯细胞和混合细胞群体中基因表达谱的分析与模拟。

Phys Biol. 2011 Jun;8(3):035013. doi: 10.1088/1478-3975/8/3/035013. Epub 2011 May 13.

RNA-sequence analysis of human B-cells.人类 B 细胞的 RNA 测序分析。

Genome Res. 2011 Jun;21(6):991-8. doi: 10.1101/gr.116335.110. Epub 2011 May 2.

Length bias correction for RNA-seq data in gene set analyses.基因集分析中 RNA-seq 数据的长度偏差校正。

Bioinformatics. 2011 Mar 1;27(5):662-9. doi: 10.1093/bioinformatics/btr005. Epub 2011 Jan 19.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

DAFS：一种用于 RNA-seq 数据的自适应标记方法，用于区分低表达和高表达基因。

DAFS: a data-adaptive flag method for RNA-sequencing data to differentiate genes with low and high expression.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献