基于机器学习的 RNA-seq 数据质量自动评估进行批次效应检测和校正。

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality.

机构信息

Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Hans-Dieter-Hüsch-Weg 15, 55128, Mainz, Germany.

出版信息

BMC Bioinformatics. 2022 Jul 14;23(Suppl 6):279. doi: 10.1186/s12859-022-04775-y.

DOI:10.1186/s12859-022-04775-y

PMID:35836114

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9284682/

Abstract

BACKGROUND

The constant evolving and development of next-generation sequencing techniques lead to high throughput data composed of datasets that include a large number of biological samples. Although a large number of samples are usually experimentally processed by batches, scientific publications are often elusive about this information, which can greatly impact the quality of the samples and confound further statistical analyzes. Because dedicated bioinformatics methods developed to detect unwanted sources of variance in the data can wrongly detect real biological signals, such methods could benefit from using a quality-aware approach.

RESULTS

We recently developed statistical guidelines and a machine learning tool to automatically evaluate the quality of a next-generation-sequencing sample. We leveraged this quality assessment to detect and correct batch effects in 12 publicly available RNA-seq datasets with available batch information. We were able to distinguish batches by our quality score and used it to correct for some batch effects in sample clustering. Overall, the correction was evaluated as comparable to or better than the reference method that uses a priori knowledge of the batches (in 10 and 1 datasets of 12, respectively; total = 92%). When coupled to outlier removal, the correction was more often evaluated as better than the reference (comparable or better in 5 and 6 datasets of 12, respectively; total = 92%).

CONCLUSIONS

In this work, we show the capabilities of our software to detect batches in public RNA-seq datasets from differences in the predicted quality of their samples. We also use these insights to correct the batch effect and observe the relation of sample quality and batch effect. These observations reinforce our expectation that while batch effects do correlate with differences in quality, batch effects also arise from other artifacts and are more suitably corrected statistically in well-designed experiments.

摘要

背景

下一代测序技术的不断发展和进步导致高通量数据的产生，这些数据集中包含大量的生物样本。尽管大量的样本通常通过批次进行实验处理，但科学出版物通常对此信息讳莫如深，这可能会极大地影响样本的质量，并混淆进一步的统计分析。由于专门开发用于检测数据中不必要的方差源的生物信息学方法可能会错误地检测到真实的生物学信号，因此这些方法可能受益于使用质量感知方法。

结果

我们最近开发了统计指南和机器学习工具，用于自动评估下一代测序样本的质量。我们利用这种质量评估方法来检测和纠正 12 个具有可用批次信息的公共 RNA-seq 数据集的批次效应。我们能够通过质量得分来区分批次，并使用它来纠正样本聚类中的一些批次效应。总体而言，校正效果评估与使用批次先验知识的参考方法相当或更好（在 12 个数据集的 10 个和 1 个中，总计为 92%）。当与异常值去除相结合时，校正效果更常被评估为优于参考方法（在 12 个数据集的 5 个和 6 个中，总计为 92%）。

结论

在这项工作中，我们展示了我们的软件在检测公共 RNA-seq 数据集中批次的能力，这些批次是通过预测样本质量的差异来实现的。我们还利用这些见解来纠正批次效应，并观察样本质量和批次效应之间的关系。这些观察结果强化了我们的预期，即虽然批次效应与质量差异相关，但批次效应也可能源于其他伪影，并且在设计良好的实验中更适合通过统计学方法进行纠正。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ee83/9284682/2465132fa28e/12859_2022_4775_Fig1_HTML.jpg

相似文献

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality.

BMC Bioinformatics. 2022 Jul 14;23(Suppl 6):279. doi: 10.1186/s12859-022-04775-y.

SMNN: batch effect correction for single-cell RNA-seq data via supervised mutual nearest neighbor detection.

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa097.

scBatch: batch-effect correction of RNA-seq data through sample distance matrix adjustment.

Bioinformatics. 2020 May 1;36(10):3115-3123. doi: 10.1093/bioinformatics/btaa097.

deepMNN: Deep Learning-Based Single-Cell RNA Sequencing Data Batch Correction Using Mutual Nearest Neighbors.

Front Genet. 2021 Aug 10;12:708981. doi: 10.3389/fgene.2021.708981. eCollection 2021.

A scoping review on deep learning for next-generation RNA-Seq. data analysis.

Funct Integr Genomics. 2023 Apr 21;23(2):134. doi: 10.1007/s10142-023-01064-6.

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

BMC Bioinformatics. 2024 May 8;25(1):181. doi: 10.1186/s12859-024-05801-x.

SCIBER: a simple method for removing batch effects from single-cell RNA-sequencing data.

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac819.

PanClassif: Improving pan cancer classification of single cell RNA-seq gene expression data using machine learning.

Genomics. 2022 Mar;114(2):110264. doi: 10.1016/j.ygeno.2022.01.001. Epub 2022 Jan 6.

coupleCoC+: An information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data.

PLoS Comput Biol. 2021 Jun 2;17(6):e1009064. doi: 10.1371/journal.pcbi.1009064. eCollection 2021 Jun.

POIBM: batch correction of heterogeneous RNA-seq datasets through latent sample matching.

Bioinformatics. 2022 Apr 28;38(9):2474-2480. doi: 10.1093/bioinformatics/btac124.

引用本文的文献

Unified mass imaging maps the lipidome of vertebrate development.

Nat Methods. 2025 Sep 3. doi: 10.1038/s41592-025-02771-7.

PSD3 as a context-dependent modulator of immune landscape and tumor aggressiveness in esophageal squamous cell carcinoma.

Front Immunol. 2025 Aug 15;16:1641254. doi: 10.3389/fimmu.2025.1641254. eCollection 2025.

A comprehensive machine learning for high throughput Tuberculosis sequence analysis, functional annotation, and visualization.

Sci Rep. 2025 Jul 16;15(1):25866. doi: 10.1038/s41598-025-98654-0.

SmilODB: a multi-omics database for the medicinal plant danshen (, Lamiaceae).

Front Plant Sci. 2025 May 20;16:1586268. doi: 10.3389/fpls.2025.1586268. eCollection 2025.

Comparative Transcriptome Analysis of Hens' Livers in Conventional Cage vs. Cage-Free Egg Production Systems.

Vet Med Int. 2025 Mar 21;2025:3041254. doi: 10.1155/vmi/3041254. eCollection 2025.

The Omics Landscape of Long COVID-A Comprehensive Systematic Review to Advance Biomarker, Target and Drug Discovery.

Allergy. 2025 Apr;80(4):932-948. doi: 10.1111/all.16526. Epub 2025 Mar 14.

NPM: latent batch effects correction of omics data by nearest-pair matching.

Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf084.

Highly effective batch effect correction method for RNA-seq count data.

Comput Struct Biotechnol J. 2024 Dec 16;27:58-64. doi: 10.1016/j.csbj.2024.12.010. eCollection 2025.

The Venus score for the assessment of the quality and trustworthiness of biomedical datasets.

BioData Min. 2025 Jan 9;18(1):1. doi: 10.1186/s13040-024-00412-x.

Characterization of Loss-of-Imprinting in Breast Cancer at the Cellular Level by Integrating Single-Cell Full-Length Transcriptome with Bulk RNA-Seq Data.

Biomolecules. 2024 Dec 14;14(12):1598. doi: 10.3390/biom14121598.

本文引用的文献

Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference.

Biostatistics. 2023 Jul 14;24(3):635-652. doi: 10.1093/biostatistics/kxab039.

Microglial transcription profiles in mouse and human are driven by APOE4 and sex.

iScience. 2021 Oct 5;24(11):103238. doi: 10.1016/j.isci.2021.103238. eCollection 2021 Nov 19.

Exploration of alcohol use disorder-associated brain miRNA-mRNA regulatory networks.

Transl Psychiatry. 2021 Oct 2;11(1):504. doi: 10.1038/s41398-021-01635-w.

Chromatin-based, in cis and in trans regulatory rewiring underpins distinct oncogenic transcriptomes in multiple myeloma.

Nat Commun. 2021 Sep 14;12(1):5450. doi: 10.1038/s41467-021-25704-2.

Statistical guidelines for quality control of next-generation sequencing techniques.

Life Sci Alliance. 2021 Aug 30;4(11). doi: 10.26508/lsa.202101113. Print 2021 Nov.

ELAVL4, splicing, and glutamatergic dysfunction precede neuron loss in MAPT mutation cerebral organoids.

Cell. 2021 Aug 19;184(17):4547-4563.e17. doi: 10.1016/j.cell.2021.07.003. Epub 2021 Jul 26.

Differential DNA methylation and mRNA transcription in gingival tissues in periodontal health and disease.

J Clin Periodontol. 2021 Sep;48(9):1152-1164. doi: 10.1111/jcpe.13504. Epub 2021 Jul 11.

Localized skin inflammation during cutaneous leishmaniasis drives a chronic, systemic IFN-γ signature.

PLoS Negl Trop Dis. 2021 Apr 1;15(4):e0009321. doi: 10.1371/journal.pntd.0009321. eCollection 2021 Apr.

seqQscorer: automated quality control of next-generation sequencing data using machine learning.

Genome Biol. 2021 Mar 5;22(1):75. doi: 10.1186/s13059-021-02294-2.

JAZF1, A Novel p400/TIP60/NuA4 Complex Member, Regulates H2A.Z Acetylation at Regulatory Regions.

Int J Mol Sci. 2021 Jan 12;22(2):678. doi: 10.3390/ijms22020678.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于机器学习的 RNA-seq 数据质量自动评估进行批次效应检测和校正。

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献