College of Computer and Control Engineering, Nankai University, Tianjin 300350, China.
Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, TX 77030, USA.
Bioinformatics. 2018 Apr 1;34(7):1141-1147. doi: 10.1093/bioinformatics/btx635.
Batch effects are one of the major source of technical variations that affect the measurements in high-throughput studies such as RNA sequencing. It has been well established that batch effects can be caused by different experimental platforms, laboratory conditions, different sources of samples and personnel differences. These differences can confound the outcomes of interest and lead to spurious results. A critical input for batch correction algorithms is the knowledge of batch factors, which in many cases are unknown or inaccurate. Hence, the primary motivation of our paper is to detect hidden batch factors that can be used in standard techniques to accurately capture the relationship between gene expression and other modeled variables of interest.
We introduce a new algorithm based on data-adaptive shrinkage and semi-Non-negative Matrix Factorization for the detection of unknown batch effects. We test our algorithm on three different datasets: (i) Sequencing Quality Control, (ii) Topotecan RNA-Seq and (iii) Single-cell RNA sequencing (scRNA-Seq) on Glioblastoma Multiforme. We have demonstrated a superior performance in identifying hidden batch effects as compared to existing algorithms for batch detection in all three datasets. In the Topotecan study, we were able to identify a new batch factor that has been missed by the original study, leading to under-representation of differentially expressed genes. For scRNA-Seq, we demonstrated the power of our method in detecting subtle batch effects.
DASC R package is available via Bioconductor or at https://github.com/zhanglabNKU/DASC.
zhanghan@nankai.edu.cn or zhandonl@bcm.edu.
Supplementary data are available at Bioinformatics online.
批次效应是影响高通量研究(如 RNA 测序)中测量结果的主要技术变异来源之一。已经证实,批次效应可能由不同的实验平台、实验室条件、不同的样本来源和人员差异引起。这些差异会混淆感兴趣的结果,并导致虚假结果。批次校正算法的一个关键输入是批次因素的知识,而在许多情况下,批次因素是未知的或不准确的。因此,我们论文的主要动机是检测隐藏的批次因素,这些因素可以用于标准技术中,以准确捕捉基因表达与其他感兴趣的建模变量之间的关系。
我们引入了一种基于数据自适应收缩和半非负矩阵分解的新算法,用于检测未知的批次效应。我们在三个不同的数据集上测试了我们的算法:(i)测序质量控制,(ii)拓扑替康 RNA-Seq 和(iii)胶质母细胞瘤多形性的单细胞 RNA 测序(scRNA-Seq)。与现有的批次检测算法相比,我们在所有三个数据集的隐藏批次效应识别方面都表现出了优异的性能。在拓扑替康研究中,我们能够识别出一个新的批次因素,该因素已被原始研究遗漏,导致差异表达基因的代表性不足。对于 scRNA-Seq,我们展示了我们的方法在检测细微批次效应方面的强大功能。
DASC R 包可通过 Bioconductor 或 https://github.com/zhanglabNKU/DASC 获得。
zhanghan@nankai.edu.cn 或 zhandonl@bcm.edu。
补充数据可在 Bioinformatics 在线获得。