Park Jiwon, Park Taesung
Interdisciplinary Program of Bioinformatics, Seoul National University, Seoul, Republic of Korea.
Department of Statistics, Seoul National University, Seoul, Republic of Korea.
Front Microbiol. 2025 Feb 25;16:1484183. doi: 10.3389/fmicb.2025.1484183. eCollection 2025.
Batch effects refer to data variations that arise from non-biological factors such as experimental conditions, equipment, and external factors. These effects are considered significant issues in the analysis of biological data since they can compromise data consistency and distort actual biological differences, which can severely skew the results of downstream analyses.
In this study, we introduce a new approach that comprehensively addresses two types of batch effects: "systematic batch effects" which are consistent across all samples in a batch, and "nonsystematic batch effects" which vary depending on the variability of operational taxonomic units (OTUs) within each sample in the same batch. To address systematic batch effects, we apply a negative binomial regression model and correct for consistent batch influences by excluding fixed batch effects. Additionally, to handle nonsystematic batch effects, we employ composite quantile regression. By adjusting the distribution of OTUs to be similar based on a reference batch selected using the Kruskal-Walis test method, we consider the variability at the OTU level.
The performance of the model is evaluated and compared with existing methods using PERMANOVA R-squared values, Principal Coordinates Analysis (PCoA) plots and Average Silhouette Coefficient calculated with diverse distance-based metrics. The model is applied to three real microbiome datasets: Metagenomic urine control data, Human Immunodeficiency Virus Re-analysis Consortium data, and Men and Women Offering Understanding of Throat HPV study data. The results demonstrate that the model effectively corrects for batch effects across all datasets.
批次效应是指由实验条件、设备和外部因素等非生物因素引起的数据变化。这些效应在生物数据分析中被视为重大问题,因为它们会损害数据的一致性并扭曲实际的生物学差异,从而严重歪曲下游分析的结果。
在本研究中,我们引入了一种新方法,该方法全面解决了两种类型的批次效应:“系统性批次效应”,即在一批中的所有样本中都是一致的;以及“非系统性批次效应”,其取决于同一批次中每个样本内可操作分类单元(OTU)的变异性。为了解决系统性批次效应,我们应用负二项式回归模型,并通过排除固定的批次效应来校正一致的批次影响。此外,为了处理非系统性批次效应,我们采用复合分位数回归。通过基于使用Kruskal-Walis检验方法选择的参考批次调整OTU的分布以使其相似,我们考虑了OTU水平的变异性。
使用PERMANOVA R平方值、主坐标分析(PCoA)图以及用不同基于距离的指标计算的平均轮廓系数来评估该模型的性能并与现有方法进行比较。该模型应用于三个真实的微生物组数据集:宏基因组尿液对照数据、人类免疫缺陷病毒重新分析联盟数据以及提供对咽喉HPV理解的男性和女性研究数据。结果表明该模型有效地校正了所有数据集中的批次效应。