Wang Jia, Tian Lili, Yan Li
Department of Biostatistics, University at Buffalo, Buffalo, NY, United States of America.
Department of Biostatistics and Bioinformatics, Roswell Park Comprehensive Cancer Center, Buffalo, NY, United States of America.
PLoS One. 2024 Dec 13;19(12):e0314705. doi: 10.1371/journal.pone.0314705. eCollection 2024.
In genomic study, log transformation is a common prepossessing step to adjust for skewness in data. This standard approach often assumes that log-transformed data is normally distributed, and two sample t-test (or its modifications) is used for detecting differences between two experimental conditions. However, recently it was shown that two sample t-test can lead to exaggerated false positives, and the Wilcoxon-Mann-Whitney (WMW) test was proposed as an alternative for studies with larger sample sizes. In addition, studies have demonstrated that the specific distribution used in modeling genomic data has profound impact on the interpretation and validity of results. The aim of this paper is three-fold: 1) to present the Exp-gamma distribution (exponential-gamma distribution stands for log-transformed gamma distribution) as a proper biological and statistical model for the analysis of log-transformed protein abundance data from single-cell experiments; 2) to demonstrate the inappropriateness of two sample t-test and the WMW test in analyzing log-transformed protein abundance data; 3) to propose and evaluate statistical inference methods for hypothesis testing and confidence interval estimation when comparing two independent samples under the Exp-gamma distributions. The proposed methods are applied to analyze protein abundance data from a single-cell dataset.
在基因组研究中,对数变换是调整数据偏度的常见预处理步骤。这种标准方法通常假定对数变换后的数据呈正态分布,并且使用两样本t检验(或其变体)来检测两种实验条件之间的差异。然而,最近有研究表明,两样本t检验可能会导致过高的假阳性率,因此有人提出将 Wilcoxon-Mann-Whitney(WMW)检验作为大样本量研究的替代方法。此外,研究还表明,用于基因组数据建模的特定分布对结果的解释和有效性有深远影响。本文的目的有三个:1)提出指数-伽马分布(指数-伽马分布代表对数变换后的伽马分布)作为分析单细胞实验中对数变换后的蛋白质丰度数据的合适生物学和统计模型;2)证明两样本t检验和WMW检验在分析对数变换后的蛋白质丰度数据时的不适用性;3)提出并评估在指数-伽马分布下比较两个独立样本时用于假设检验和置信区间估计的统计推断方法。所提出的方法被应用于分析来自一个单细胞数据集的蛋白质丰度数据。