测序数据中被忽视的低质量患者样本会影响已发表的具有临床相关性数据集的可重复性。

Overlooked poor-quality patient samples in sequencing data impair reproducibility of published clinically relevant datasets.

机构信息

Faculty of Biology, Johannes Gutenberg-Universität Mainz, Biozentrum I, Hans-Dieter-Hüsch-Weg 15, Mainz, 55128, Germany.

Central Institute for Decision Support Systems in Crop Protection (ZEPP), Rüdesheimer Str. 60-68, Bad Kreuznach, 55545, Germany.

出版信息

Genome Biol. 2024 Aug 16;25(1):222. doi: 10.1186/s13059-024-03331-6.

DOI:10.1186/s13059-024-03331-6

PMID:39152483

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11328481/

Abstract

BACKGROUND

Reproducibility is a major concern in biomedical studies, and existing publication guidelines do not solve the problem. Batch effects and quality imbalances between groups of biological samples are major factors hampering reproducibility. Yet, the latter is rarely considered in the scientific literature.

RESULTS

Our analysis uses 40 clinically relevant RNA-seq datasets to quantify the impact of quality imbalance between groups of samples on the reproducibility of gene expression studies. High-quality imbalance is frequent (14 datasets; 35%), and hundreds of quality markers are present in more than 50% of the datasets. Enrichment analysis suggests common stress-driven effects among the low-quality samples and highlights a complementary role of transcription factors and miRNAs to regulate stress response. Preliminary ChIP-seq results show similar trends. Quality imbalance has an impact on the number of differential genes derived by comparing control to disease samples (the higher the imbalance, the higher the number of genes), on the proportion of quality markers in top differential genes (the higher the imbalance, the higher the proportion; up to 22%) and on the proportion of known disease genes in top differential genes (the higher the imbalance, the lower the proportion). We show that removing outliers based on their quality score improves the resulting downstream analysis.

CONCLUSIONS

Thanks to a stringent selection of well-designed datasets, we demonstrate that quality imbalance between groups of samples can significantly reduce the relevance of differential genes, consequently reducing reproducibility between studies. Appropriate experimental design and analysis methods can substantially reduce the problem.

摘要

背景

可重复性是生物医学研究中的一个主要关注点，现有的出版指南并不能解决这个问题。批次效应和生物样本组之间的质量不平衡是阻碍可重复性的主要因素。然而，后者在科学文献中很少被考虑。

结果

我们的分析使用了 40 个具有临床相关性的 RNA-seq 数据集，来量化样本组之间的质量不平衡对基因表达研究可重复性的影响。高质量的不平衡是常见的（14 个数据集；35%），并且在超过 50%的数据集中有数百个质量标记物。富集分析表明，低质量样本之间存在共同的应激驱动效应，并强调转录因子和 miRNAs 在调节应激反应中的互补作用。初步的 ChIP-seq 结果显示出类似的趋势。质量不平衡会影响从对照到疾病样本比较中得出的差异基因数量（不平衡越高，差异基因的数量就越多），也会影响质量标记物在顶级差异基因中的比例（不平衡越高，比例就越高，最高可达 22%），还会影响已知疾病基因在顶级差异基因中的比例（不平衡越高，比例就越低）。我们表明，根据质量评分去除离群值可以改善下游分析的结果。