Kruppa Jochen, Kramer Frank, Beißbarth Tim, Jung Klaus
Stat Appl Genet Mol Biol. 2016 Oct 1;15(5):401-414. doi: 10.1515/sagmb-2015-0082.
As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.
作为高通量测序实验数据处理的一部分,会产生计数数据,这些数据代表映射到特定基因组区域的 reads 数量。计数数据也出现在用于检测蛋白质 - 蛋白质相互作用的质谱实验中。因此,为了评估用于分析来自蛋白质组学实验的测序计数数据或光谱计数数据的新计算方法,需要人工计数数据。尽管已经提出了一些生成人工测序计数数据的方法,但所有这些方法都模拟单个测序运行,从而忽略了各个基因组特征之间的相关结构,或者它们仅限于特定结构。我们建议从多元正态分布中抽取相关数据,并对这些连续数据进行四舍五入以获得离散计数。在我们的方法中,所需的分布参数可以通过不同方式构建,也可以从实际计数数据中估计。由于四舍五入影响相关结构,我们评估了在 DNA 微阵列的人工表达数据背景下已经使用的收缩估计器的使用。我们的方法被证明对于模拟定义的特征子集(如单个途径或 GO 类别)的计数很有用。