Yosef Arthur, Shnaider Eli, Schneider Moti, Gurevich Michael
Tel Aviv-Yaffo Academic College, Yaffo, Israel.
Netanya Academic College, Netanya, Israel.
Bioinform Biol Insights. 2023 Mar 31;17:11779322231160397. doi: 10.1177/11779322231160397. eCollection 2023.
In this study, we introduce an artificial intelligent method for addressing the batch effect of a transcriptome data. The method has several clear advantages in comparison with the alternative methods presently in use. Batch effect refers to the discrepancy in gene expression data series, measured under different conditions. While the data from the same batch (measurements performed under the same conditions) are compatible, combining various batches into 1 data set is problematic because of incompatible measurements. Therefore, it is necessary to perform correction of the combined data (normalization), before performing biological analysis. There are numerous methods attempting to correct data set for batch effect. These methods rely on various assumptions regarding the distribution of the measurements. Forcing the data elements into pre-supposed distribution can severely distort biological signals, thus leading to incorrect results and conclusions. As the discrepancy between the assumptions regarding the data distribution and the actual distribution is wider, the biases introduced by such "correction methods" are greater. We introduce a heuristic method to reduce batch effect. The method does not rely on any assumptions regarding the distribution and the behavior of data elements. Hence, it does not introduce any new biases in the process of correcting the batch effect. It strictly maintains the integrity of measurements within the original batches.
在本研究中,我们介绍了一种用于解决转录组数据批次效应的人工智能方法。与目前使用的其他方法相比,该方法具有几个明显的优势。批次效应是指在不同条件下测量的基因表达数据系列中的差异。虽然来自同一批次的数据(在相同条件下进行的测量)是兼容的,但由于测量不兼容,将不同批次的数据合并成一个数据集会存在问题。因此,在进行生物学分析之前,有必要对合并后的数据进行校正(归一化)。有许多方法试图校正数据集的批次效应。这些方法依赖于关于测量分布的各种假设。将数据元素强制纳入预先设定的分布可能会严重扭曲生物学信号,从而导致错误的结果和结论。随着关于数据分布的假设与实际分布之间的差异越大,这种“校正方法”引入的偏差就越大。我们引入了一种启发式方法来减少批次效应。该方法不依赖于关于数据元素分布和行为的任何假设。因此,它在校正批次效应的过程中不会引入任何新的偏差。它严格保持原始批次内测量的完整性。