Institute of Automatic Control, Silesian University of Technology, Gliwice, Poland.
Department of Internal Medicine, Yale School of Medicine, Yale University, New Haven, CT, USA.
Bioinformatics. 2019 Jun 1;35(11):1885-1892. doi: 10.1093/bioinformatics/bty900.
In contemporary biological experiments, bias, which interferes with the measurements, requires attentive processing. Important sources of bias in high-throughput biological experiments are batch effects and diverse methods towards removal of batch effects have been established. These include various normalization techniques, yet many require knowledge on the number of batches and assignment of samples to batches. Only few can deal with the problem of identification of batch effect of unknown structure. For this reason, an original batch identification algorithm through dynamical programming is introduced for omics data that may be sorted on a timescale.
BatchI algorithm is based on partitioning a series of high-throughput experiment samples into sub-series corresponding to estimated batches. The dynamic programming method is used for splitting data with maximal dispersion between batches, while maintaining minimal within batch dispersion. The procedure has been tested on a number of available datasets with and without prior information about batch partitioning. Datasets with a priori identified batches have been split accordingly, measured with weighted average Dice Index. Batch effect correction is justified by higher intra-group correlation. In the blank datasets, identified batch divisions lead to improvement of parameters and quality of biological information, shown by literature study and Information Content. The outcome of the algorithm serves as a starting point for correction methods. It has been demonstrated that omitting the essential step of batch effect control may lead to waste of valuable potential discoveries.
The implementation is available within the BatchI R package at http://zaed.aei.polsl.pl/index.php/pl/111-software.
Supplementary data are available at Bioinformatics online.
在当代生物实验中,干扰测量的偏差需要进行仔细处理。高通量生物实验中重要的偏差来源是批次效应,并且已经建立了多种去除批次效应的方法。这些方法包括各种归一化技术,但许多方法都需要了解批次的数量以及样本分配到批次的情况。只有少数方法可以处理结构未知的批次效应识别问题。为此,引入了一种原始的通过动态规划进行组学数据批次识别的算法,这些数据可能按时间尺度进行排序。
BatchI 算法基于将一系列高通量实验样本划分为与估计批次相对应的子序列。动态规划方法用于在保持批次内最小分散的同时,最大化批次间的分散来分割数据。该程序已经在具有和不具有批次划分先验信息的可用数据集上进行了测试。具有先验识别的批次的数据集已经根据加权平均 Dice 指数进行了相应的划分。通过更高的组内相关性来证明批处理校正的合理性。在空白数据集,识别的批次划分导致参数和生物信息质量的提高,通过文献研究和信息含量来显示。算法的结果可作为校正方法的起点。研究表明,省略批处理控制的基本步骤可能会导致有价值的潜在发现的浪费。
实现可在 BatchI R 包中获得,网址为 http://zaed.aei.polsl.pl/index.php/pl/111-software。
补充数据可在生物信息学在线获得。