通过数据自适应调整检测生物效应中的隐藏批次因素。

Detecting hidden batch factors through data-adaptive adjustment for biological effects.

机构信息

College of Computer and Control Engineering, Nankai University, Tianjin 300350, China.

Graduate Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, Houston, TX 77030, USA.

出版信息

Bioinformatics. 2018 Apr 1;34(7):1141-1147. doi: 10.1093/bioinformatics/btx635.

DOI:10.1093/bioinformatics/btx635

PMID:29617963

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6454417/

Abstract

MOTIVATION

Batch effects are one of the major source of technical variations that affect the measurements in high-throughput studies such as RNA sequencing. It has been well established that batch effects can be caused by different experimental platforms, laboratory conditions, different sources of samples and personnel differences. These differences can confound the outcomes of interest and lead to spurious results. A critical input for batch correction algorithms is the knowledge of batch factors, which in many cases are unknown or inaccurate. Hence, the primary motivation of our paper is to detect hidden batch factors that can be used in standard techniques to accurately capture the relationship between gene expression and other modeled variables of interest.

RESULTS

We introduce a new algorithm based on data-adaptive shrinkage and semi-Non-negative Matrix Factorization for the detection of unknown batch effects. We test our algorithm on three different datasets: (i) Sequencing Quality Control, (ii) Topotecan RNA-Seq and (iii) Single-cell RNA sequencing (scRNA-Seq) on Glioblastoma Multiforme. We have demonstrated a superior performance in identifying hidden batch effects as compared to existing algorithms for batch detection in all three datasets. In the Topotecan study, we were able to identify a new batch factor that has been missed by the original study, leading to under-representation of differentially expressed genes. For scRNA-Seq, we demonstrated the power of our method in detecting subtle batch effects.

AVAILABILITY AND IMPLEMENTATION

DASC R package is available via Bioconductor or at https://github.com/zhanglabNKU/DASC.

CONTACT

zhanghan@nankai.edu.cn or zhandonl@bcm.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

批次效应是影响高通量研究（如 RNA 测序）中测量结果的主要技术变异来源之一。已经证实，批次效应可能由不同的实验平台、实验室条件、不同的样本来源和人员差异引起。这些差异会混淆感兴趣的结果，并导致虚假结果。批次校正算法的一个关键输入是批次因素的知识，而在许多情况下，批次因素是未知的或不准确的。因此，我们论文的主要动机是检测隐藏的批次因素，这些因素可以用于标准技术中，以准确捕捉基因表达与其他感兴趣的建模变量之间的关系。

结果

我们引入了一种基于数据自适应收缩和半非负矩阵分解的新算法，用于检测未知的批次效应。我们在三个不同的数据集上测试了我们的算法：（i）测序质量控制，（ii）拓扑替康 RNA-Seq 和（iii）胶质母细胞瘤多形性的单细胞 RNA 测序（scRNA-Seq）。与现有的批次检测算法相比，我们在所有三个数据集的隐藏批次效应识别方面都表现出了优异的性能。在拓扑替康研究中，我们能够识别出一个新的批次因素，该因素已被原始研究遗漏，导致差异表达基因的代表性不足。对于 scRNA-Seq，我们展示了我们的方法在检测细微批次效应方面的强大功能。

可用性和实现

DASC R 包可通过 Bioconductor 或 https://github.com/zhanglabNKU/DASC 获得。

联系方式

zhanghan@nankai.edu.cn 或 zhandonl@bcm.edu。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

Detecting hidden batch factors through data-adaptive adjustment for biological effects.通过数据自适应调整检测生物效应中的隐藏批次因素。

Bioinformatics. 2018 Apr 1;34(7):1141-1147. doi: 10.1093/bioinformatics/btx635.

GDASC: a GPU parallel-based web server for detecting hidden batch factors.GDASC：一个基于 GPU 并行的隐藏批次因子检测网络服务器。

Bioinformatics. 2020 Aug 15;36(14):4211-4213. doi: 10.1093/bioinformatics/btaa427.

scBGEDA: deep single-cell clustering analysis via a dual denoising autoencoder with bipartite graph ensemble clustering.scBGEDA：基于双分图集成分聚类的对偶去噪自动编码器的单细胞聚类分析。

Bioinformatics. 2023 Feb 14;39(2). doi: 10.1093/bioinformatics/btad075.

ResPAN: a powerful batch correction model for scRNA-seq data through residual adversarial networks.ResPAN：通过残差对抗网络对 scRNA-seq 数据进行强大的批量校正模型。

Bioinformatics. 2022 Aug 10;38(16):3942-3949. doi: 10.1093/bioinformatics/btac427.

ASAP: a web-based platform for the analysis and interactive visualization of single-cell RNA-seq data.ASAP：一个用于单细胞 RNA-seq 数据分析和交互式可视化的基于网络的平台。

Bioinformatics. 2017 Oct 1;33(19):3123-3125. doi: 10.1093/bioinformatics/btx337.

Mitigating the adverse impact of batch effects in sample pattern detection.减轻样本模式检测中批次效应的不利影响。

Bioinformatics. 2018 Aug 1;34(15):2634-2641. doi: 10.1093/bioinformatics/bty117.

V-SVA: an R Shiny application for detecting and annotating hidden sources of variation in single-cell RNA-seq data.V-SVA：一个用于检测和注释单细胞 RNA-seq 数据中隐藏变异源的 R Shiny 应用程序。

Bioinformatics. 2020 Jun 1;36(11):3582-3584. doi: 10.1093/bioinformatics/btaa128.

Scalable preprocessing for sparse scRNA-seq data exploiting prior knowledge.利用先验知识对稀疏 scRNA-seq 数据进行可扩展的预处理。

Bioinformatics. 2018 Jul 1;34(13):i124-i132. doi: 10.1093/bioinformatics/bty293.

BERMAD: batch effect removal for single-cell RNA-seq data using a multi-layer adaptation autoencoder with dual-channel framework.BERMAD：基于双通道框架的多层自适应自动编码器去除单细胞 RNA-seq 数据中的批次效应

Bioinformatics. 2024 Mar 4;40(3). doi: 10.1093/bioinformatics/btae127.

HDMC: a novel deep learning-based framework for removing batch effects in single-cell RNA-seq data.HDMC：一种用于去除单细胞 RNA-seq 数据中批次效应的新型深度学习框架。

Bioinformatics. 2022 Feb 7;38(5):1295-1303. doi: 10.1093/bioinformatics/btab821.

引用本文的文献

A machine learning pipeline for efficient differentiation between bipolar and major depressive disorder based on multimodal structural neuroimaging.一种基于多模态结构神经成像的用于有效区分双相情感障碍和重度抑郁症的机器学习流程。

Neurosci Appl. 2023 Dec 22;3:103931. doi: 10.1016/j.nsa.2023.103931. eCollection 2024.

Ethanol Concentration Determination in Baijiu by Graph-Regularized PCA and Random Forest-Based Raman Spectroscopy.基于图正则化主成分分析和随机森林的拉曼光谱法测定白酒中的乙醇浓度

ACS Omega. 2025 Apr 3;10(14):14373-14381. doi: 10.1021/acsomega.5c00616. eCollection 2025 Apr 15.

Human milk feeding practices and serum immune profiles of one-year-old infants in the CHILD birth cohort study.儿童出生队列研究中一岁婴儿的母乳喂养习惯与血清免疫谱

Am J Clin Nutr. 2025 Jan;121(1):60-73. doi: 10.1016/j.ajcnut.2024.10.021. Epub 2024 Oct 30.

Developing a Reproducible Radiomics Model for Diagnosis of Active Crohn's Disease on CT Enterography Across Annotation Variations and Acquisition Differences.建立一种可重复的放射组学模型，用于在存在标注差异和采集差异的情况下，通过CT小肠造影诊断活动性克罗恩病。

J Imaging Inform Med. 2025 Jun;38(3):1594-1605. doi: 10.1007/s10278-024-01303-7. Epub 2024 Oct 28.

Thinking points for effective batch correction on biomedical data.生物医学数据有效批量校正的思考要点。

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae515.

Procrustes is a machine-learning approach that removes cross-platform batch effects from clinical RNA sequencing data.Procrustes 是一种机器学习方法，可消除临床 RNA 测序数据中的跨平台批次效应。

Commun Biol. 2024 Mar 30;7(1):392. doi: 10.1038/s42003-024-06020-z.

CoRegNet: unraveling gene co-regulation networks from public RNA-Seq repositories using a beta-binomial statistical model.CoRegNet：利用贝塔二项式统计模型从公共 RNA-Seq 存储库中解析基因共调控网络。

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad380.

Identification of a Five-mRNA Signature as a Novel Potential Prognostic Biomarker for Glioblastoma by Integrative Analysis.通过综合分析鉴定一种五信使核糖核酸特征作为胶质母细胞瘤的新型潜在预后生物标志物

Front Genet. 2022 Jul 8;13:931938. doi: 10.3389/fgene.2022.931938. eCollection 2022.

Identification and Validation of Candidate Gene Module Along With Immune Cells Infiltration Patterns in Atherosclerosis Progression to Plaque Rupture Transcriptome Analysis.动脉粥样硬化进展至斑块破裂过程中候选基因模块的鉴定与验证以及免疫细胞浸润模式：转录组分析

Front Cardiovasc Med. 2022 Jun 22;9:894879. doi: 10.3389/fcvm.2022.894879. eCollection 2022.

Inferring Multiple Sclerosis Stages from the Blood Transcriptome via Machine Learning.通过机器学习从血液转录组推断多发性硬化症阶段。

Cell Rep Med. 2020 Jul 21;1(4):100053. doi: 10.1016/j.xcrm.2020.100053.

本文引用的文献

Missing data and technical variability in single-cell RNA-sequencing experiments.单细胞 RNA 测序实验中的数据缺失和技术变异性。

Biostatistics. 2018 Oct 1;19(4):562-578. doi: 10.1093/biostatistics/kxx053.

Reproducible RNA-seq analysis using recount2.使用recount2进行可重复的RNA测序分析。

Nat Biotechnol. 2017 Apr 11;35(4):319-321. doi: 10.1038/nbt.3838.

GFS: fuzzy preprocessing for effective gene expression analysis.GFS：用于有效基因表达分析的模糊预处理

BMC Bioinformatics. 2016 Dec 23;17(Suppl 17):540. doi: 10.1186/s12859-016-1327-8.

Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.Scater：R语言中单细胞RNA测序数据的预处理、质量控制、标准化和可视化

Bioinformatics. 2017 Apr 15;33(8):1179-1186. doi: 10.1093/bioinformatics/btw777.

Batch effects and the effective design of single-cell gene expression studies.批次效应与单细胞基因表达研究的有效设计。

Sci Rep. 2017 Jan 3;7:39921. doi: 10.1038/srep39921.

Improving cross-study prediction through addon batch effect adjustment or addon normalization.通过附加批次效应调整或附加归一化来改善跨研究预测。

Bioinformatics. 2017 Feb 1;33(3):397-404. doi: 10.1093/bioinformatics/btw650.

Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.跨细胞合并以对具有大量零计数的单细胞RNA测序数据进行标准化。

Genome Biol. 2016 Apr 27;17:75. doi: 10.1186/s13059-016-0947-7.

Splitting Methods for Convex Clustering.凸聚类的分裂方法

J Comput Graph Stat. 2015;24(4):994-1013. doi: 10.1080/10618600.2014.948181. Epub 2015 Dec 10.

A reanalysis of mouse ENCODE comparative gene expression data.小鼠ENCODE比较基因表达数据的重新分析。

F1000Res. 2015 May 19;4:121. doi: 10.12688/f1000research.6536.1. eCollection 2015.

Removing batch effects from purified plasma cell gene expression microarrays with modified ComBat.使用改良的ComBat去除纯化浆细胞基因表达微阵列中的批次效应。

BMC Bioinformatics. 2015 Feb 25;16:63. doi: 10.1186/s12859-015-0478-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验