Suppr超能文献

基于先验基因-基因关联的批量校正评估框架:在 GTEx 数据集上的应用。

Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset.

机构信息

Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.

Faculty of Medicine, Technion - Israel Institute of Technology, Haifa, Israel.

出版信息

BMC Bioinformatics. 2019 May 28;20(1):268. doi: 10.1186/s12859-019-2855-9.

Abstract

BACKGROUND

Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher.

RESULTS

We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project.

CONCLUSIONS

Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.

摘要

背景

纠正存在多种混杂因素的异质数据集通常是一项重要的生物信息学任务。试图去除这些批次效应会导致一些有意义的生物学信号丢失。因此,一个核心挑战是评估去除不需要的技术变异是否会损害研究人员感兴趣的生物学信号。

结果

我们描述了一种新的框架 B-CeF,用于评估批次校正方法的有效性及其过度或不足校正的趋势。该方法基于将调整后的基因-基因对的共表达与基于数千个无关实验的外部参考的高度可信的基因-基因关联的先验知识进行比较。我们的框架包括三个步骤:(1)使用所需方法进行数据调整,(2)计算调整后数据集的基因-基因共表达测量值,(3)根据黄金标准评估共表达测量值的性能。使用该框架,我们评估了五种应用于来自 GTEx 项目的六个代表性组织数据集的 RNA-seq 数据的批次校正方法。

结论

我们的框架能够评估批次校正方法,以更好地保留原始生物学信号。我们表明,使用多元线性回归模型来纠正已知混杂因素优于基于因子分析的方法,后者估计隐藏混杂因素。该代码作为 R 包公开可用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8629/6537327/1a0aa265dd16/12859_2019_2855_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验