Chilimoniuk Jarosław, Gosiewska Alicja, Słowik Jadwiga, Weiss Romano, Deckert P Markus, Rödiger Stefan, Burdukiewicz Michał
Department of Bioinformatics and Genomics, Faculty of Biotechnology, University of Wrocław, Wrocław, Poland.
Faculty of Natural Sciences, Brandenburg University of Technology Cottbus-Senftenberg, Senftenberg, Germany.
Ann Transl Med. 2021 Apr;9(7):528. doi: 10.21037/atm-20-6363.
DNA double-strand breaks can be counted as discrete foci by imaging techniques. In personalized medicine and pharmacology, the analysis of counting data is relevant for numerous applications, e.g., for cancer and aging research and the evaluation of drug efficacy. By default, it is assumed to follow the Poisson distribution. This assumption, however, may lead to biased results and faulty conclusions in datasets with excess zero values (zero-inflation), a variance larger than the mean (overdispersion), or both. In such cases, the assumption of a Poisson distribution would skew the estimation of mean and variance, and other models like the negative binomial (NB), zero-inflated Poisson or zero-inflated NB distributions should be employed. The model chosen has an influence on the parameter estimation (mean value and confidence interval). Yet the choice of the suitable distribution model is not trivial.
To support, simplify and objectify this process, we have developed the countfitteR software as an R package. We used a Bayesian approach for distribution model selection and the shiny web application framework for interactive data analysis.
We show the application of our software based on examples of DNA double-strand break count data from phenotypic imaging by multiplex fluorescence microscopy. In analyzing numerous datasets of molecular pharmacological markers (phosphorylated histone H2AX and p53 binding protein), countfitteR demonstrated an equal or superior statistical performance compared to the usually employed two-step procedure, with an overall power of up to 98%. In addition, it still gave information in cases with no result at all from the two-step procedure. In our data sample we found that the NB distribution was the most frequent, with the Poisson distribution taking second place.
countfitteR can perform an automated distribution model selection and thus support the data analysis and lead to objective statistically verifiable estimated values. Originally designed for the analysis of foci in biomedical image data, countfitteR can be used in a variety of areas where non-Poisson distributed counting data is prevalent.
DNA双链断裂可通过成像技术计为离散的病灶。在个性化医学和药理学中,计数数据分析在众多应用中具有相关性,例如癌症和衰老研究以及药物疗效评估。默认情况下,假定其遵循泊松分布。然而,在具有过多零值(零膨胀)、方差大于均值(过度离散)或两者兼有的数据集中,这一假设可能导致有偏差的结果和错误的结论。在这种情况下,泊松分布的假设会使均值和方差的估计产生偏差,应采用其他模型,如负二项分布(NB)、零膨胀泊松分布或零膨胀NB分布。所选模型会对参数估计(均值和置信区间)产生影响。然而,选择合适的分布模型并非易事。
为了支持、简化并使这一过程客观化,我们开发了countfitteR软件作为一个R包。我们采用贝叶斯方法进行分布模型选择,并使用闪亮的网络应用框架进行交互式数据分析。
我们基于多重荧光显微镜表型成像的DNA双链断裂计数数据示例展示了我们软件的应用。在分析众多分子药理学标志物(磷酸化组蛋白H2AX和p53结合蛋白)的数据集时,countfitteR与通常采用的两步法相比,表现出同等或更优的统计性能,总体效能高达98%。此外,在两步法完全没有结果的情况下,它仍能提供信息。在我们的数据样本中,我们发现NB分布最为常见,泊松分布位居第二。
countfitteR可以执行自动分布模型选择,从而支持数据分析并得出客观的、经统计验证的估计值。countfitteR最初设计用于分析生物医学图像数据中的病灶,可用于非泊松分布计数数据普遍存在的各种领域。