Suppr超能文献

基于机器学习的 RNA-seq 数据质量自动评估进行批次效应检测和校正。

Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality.

机构信息

Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Hans-Dieter-Hüsch-Weg 15, 55128, Mainz, Germany.

出版信息

BMC Bioinformatics. 2022 Jul 14;23(Suppl 6):279. doi: 10.1186/s12859-022-04775-y.

Abstract

BACKGROUND

The constant evolving and development of next-generation sequencing techniques lead to high throughput data composed of datasets that include a large number of biological samples. Although a large number of samples are usually experimentally processed by batches, scientific publications are often elusive about this information, which can greatly impact the quality of the samples and confound further statistical analyzes. Because dedicated bioinformatics methods developed to detect unwanted sources of variance in the data can wrongly detect real biological signals, such methods could benefit from using a quality-aware approach.

RESULTS

We recently developed statistical guidelines and a machine learning tool to automatically evaluate the quality of a next-generation-sequencing sample. We leveraged this quality assessment to detect and correct batch effects in 12 publicly available RNA-seq datasets with available batch information. We were able to distinguish batches by our quality score and used it to correct for some batch effects in sample clustering. Overall, the correction was evaluated as comparable to or better than the reference method that uses a priori knowledge of the batches (in 10 and 1 datasets of 12, respectively; total = 92%). When coupled to outlier removal, the correction was more often evaluated as better than the reference (comparable or better in 5 and 6 datasets of 12, respectively; total = 92%).

CONCLUSIONS

In this work, we show the capabilities of our software to detect batches in public RNA-seq datasets from differences in the predicted quality of their samples. We also use these insights to correct the batch effect and observe the relation of sample quality and batch effect. These observations reinforce our expectation that while batch effects do correlate with differences in quality, batch effects also arise from other artifacts and are more suitably  corrected statistically in well-designed experiments.

摘要

背景

下一代测序技术的不断发展和进步导致高通量数据的产生,这些数据集中包含大量的生物样本。尽管大量的样本通常通过批次进行实验处理,但科学出版物通常对此信息讳莫如深,这可能会极大地影响样本的质量,并混淆进一步的统计分析。由于专门开发用于检测数据中不必要的方差源的生物信息学方法可能会错误地检测到真实的生物学信号,因此这些方法可能受益于使用质量感知方法。

结果

我们最近开发了统计指南和机器学习工具,用于自动评估下一代测序样本的质量。我们利用这种质量评估方法来检测和纠正 12 个具有可用批次信息的公共 RNA-seq 数据集的批次效应。我们能够通过质量得分来区分批次,并使用它来纠正样本聚类中的一些批次效应。总体而言,校正效果评估与使用批次先验知识的参考方法相当或更好(在 12 个数据集的 10 个和 1 个中,总计为 92%)。当与异常值去除相结合时,校正效果更常被评估为优于参考方法(在 12 个数据集的 5 个和 6 个中,总计为 92%)。

结论

在这项工作中,我们展示了我们的软件在检测公共 RNA-seq 数据集中批次的能力,这些批次是通过预测样本质量的差异来实现的。我们还利用这些见解来纠正批次效应,并观察样本质量和批次效应之间的关系。这些观察结果强化了我们的预期,即虽然批次效应与质量差异相关,但批次效应也可能源于其他伪影,并且在设计良好的实验中更适合通过统计学方法进行纠正。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ee83/9284682/2465132fa28e/12859_2022_4775_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验