Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, FDA, Jefferson, AR 72079, USA.
BMC Bioinformatics. 2014 Mar 31;15:92. doi: 10.1186/1471-2105-15-92.
Next-generation sequencing (NGS) has advanced the application of high-throughput sequencing technologies in genetic and genomic variation analysis. Due to the large dynamic range of expression levels, RNA-seq is more prone to detect transcripts with low expression. It is clear that genes with no mapped reads are not expressed; however, there is ongoing debate about the level of abundance that constitutes biologically meaningful expression. To date, there is no consensus on the definition of low expression. Since random variation is high in regions with low expression and distributions of transcript expression are affected by numerous experimental factors, methods to differentiate low and high expressed data in a sample are critical to interpreting classes of abundance levels in RNA-seq data.
A data-adaptive approach was developed to estimate the lower bound of high expression for RNA-seq data. The Kolmgorov-Smirnov statistic and multivariate adaptive regression splines were used to determine the optimal cutoff value for separating transcripts with high and low expression. Results from the proposed method were compared to results obtained by estimating the theoretical cutoff of a fitted two-component mixture distribution. The robustness of the proposed method was demonstrated by analyzing different RNA-seq datasets that varied by sequencing depth, species, scale of measurement, and empirical density shape.
The analysis of real and simulated data presented here illustrates the need to employ data-adaptive methodology in lieu of arbitrary cutoffs to distinguish low expressed RNA-seq data from high expression. Our results also present the drawbacks of characterizing the data by a two-component mixture distribution when classes of gene expression are not well separated. The ability to ascertain stably expressed RNA-seq data is essential in the filtering process of data analysis, and methodologies that consider the underlying data structure demonstrate superior performance in preserving most of the interpretable and meaningful data. The proposed algorithm for classifying low and high regions of transcript abundance promises wide-range application in the continuing development of RNA-seq analysis.
下一代测序(NGS)技术已经推动了高通量测序技术在遗传和基因组变异分析中的应用。由于表达水平的动态范围很大,RNA-seq 更容易检测到低表达的转录本。显然,没有被映射到reads 的基因是不表达的;然而,对于构成有意义的表达的丰度水平,仍存在争议。迄今为止,对于低表达的定义还没有达成共识。由于低表达区域的随机变异较大,并且转录本表达的分布受到许多实验因素的影响,因此区分样本中低表达和高表达数据的方法对于解释 RNA-seq 数据中的丰度水平类别至关重要。
开发了一种数据自适应方法来估计 RNA-seq 数据中高表达的下限。使用 Kolmogorov-Smirnov 统计量和多元自适应回归样条来确定区分高表达和低表达转录本的最佳截止值。与通过估计拟合的两分量混合分布的理论截止值获得的结果进行比较。通过分析不同的 RNA-seq 数据集,包括测序深度、物种、测量规模和经验密度形状的差异,证明了该方法的稳健性。
本文对真实和模拟数据的分析表明,需要采用数据自适应方法,而不是任意的截止值,来区分低表达的 RNA-seq 数据和高表达的 RNA-seq 数据。我们的结果还表明,当基因表达的类别没有很好地分离时,用两分量混合分布来描述数据存在缺陷。确定稳定表达的 RNA-seq 数据的能力是数据分析过滤过程中的关键,并且考虑底层数据结构的方法在保留大部分可解释和有意义的数据方面表现出优越的性能。用于分类转录本丰度的低和高区域的建议算法有望在 RNA-seq 分析的不断发展中得到广泛应用。