POIBM：通过潜在样本匹配对异质RNA测序数据集进行批量校正。

POIBM: batch correction of heterogeneous RNA-seq datasets through latent sample matching.

作者信息

Holmström Susanna, Hautaniemi Sampsa, Häkkinen Antti

机构信息

Research Program in Systems Oncology, Research Programs Unit, Faculty of Medicine, University of Helsinki, FI-00014 Helsinki, Finland.

出版信息

Bioinformatics. 2022 Apr 28;38(9):2474-2480. doi: 10.1093/bioinformatics/btac124.

DOI:10.1093/bioinformatics/btac124

PMID:35199138

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9048693/

Abstract

MOTIVATION

RNA sequencing and other high-throughput technologies are essential in understanding complex diseases, such as cancers, but are susceptible to technical factors manifesting as patterns in the measurements. These batch patterns hinder the discovery of biologically relevant patterns. Unbiased batch effect correction in heterogeneous populations currently requires special experimental designs or phenotypic labels, which are not readily available for patient samples in existing datasets.

RESULTS

We present POIBM, an RNA-seq batch correction method, which learns virtual reference samples directly from the data. We use a breast cancer cell line dataset to show that POIBM exceeds or matches the performance of previous methods, while being blind to the phenotypes. Further, we analyze The Cancer Genome Atlas RNA-seq data to show that batch effects plague many cancer types; POIBM effectively discovers the true replicates in stomach adenocarcinoma; and integrating the corrected data in endometrial carcinoma improves cancer subtyping.

AVAILABILITY AND IMPLEMENTATION

https://bitbucket.org/anthakki/poibm/ (archived at https://doi.org/10.5281/zenodo.6122436).

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

RNA测序和其他高通量技术对于理解诸如癌症等复杂疾病至关重要，但易受技术因素影响，这些因素会在测量结果中表现为特定模式。这些批次模式阻碍了生物学相关模式的发现。目前，在异质群体中进行无偏批次效应校正需要特殊的实验设计或表型标签，而在现有数据集中，患者样本并不容易获得这些信息。

结果

我们提出了POIBM，一种RNA测序批次校正方法，它直接从数据中学习虚拟参考样本。我们使用一个乳腺癌细胞系数据集表明，POIBM的性能超过或与先前方法相当，同时对表型不敏感。此外，我们分析了癌症基因组图谱RNA测序数据，以表明批次效应困扰着许多癌症类型；POIBM有效地发现了胃腺癌中的真实重复样本；并且将校正后的数据整合到子宫内膜癌中可改善癌症亚型分类。