Wellcome-MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge CB2 0AW, UK.
Life Sciences-Transcriptomics and Functional Genomics Lab, Barcelona Supercomputing Center (BSC-CNS), 08034 Barcelona, Spain.
Genes (Basel). 2022 Dec 1;13(12):2265. doi: 10.3390/genes13122265.
The advances in high-throughput sequencing (HTS) have enabled the characterisation of biological processes at an unprecedented level of detail; most hypotheses in molecular biology rely on analyses of HTS data. However, achieving increased robustness and reproducibility of results remains a main challenge. Although variability in results may be introduced at various stages, e.g., alignment, summarisation or detection of differential expression, one source of variability was systematically omitted: the sequencing design, which propagates through analyses and may introduce an additional layer of technical variation. We illustrate qualitative and quantitative differences arising from splitting samples across lanes on bulk and single-cell sequencing. For bulk mRNAseq data, we focus on differential expression and enrichment analyses; for bulk ChIPseq data, we investigate the effect on peak calling and the peaks' properties. At the single-cell level, we concentrate on identifying cell subpopulations. We rely on markers used for assigning cell identities; both smartSeq and 10× data are presented. The observed reduction in the number of unique sequenced fragments limits the level of detail on which the different prediction approaches depend. Furthermore, the sequencing stochasticity adds in a weighting bias corroborated with variable sequencing depths and (yet unexplained) sequencing bias. Subsequently, we observe an overall reduction in sequencing complexity and a distortion in the biological signal across technologies, experimental contexts, organisms and tissues.
高通量测序(HTS)的进步使我们能够以前所未有的详细程度来描述生物过程;分子生物学中的大多数假说都依赖于 HTS 数据的分析。然而,实现结果的更高稳健性和可重复性仍然是一个主要挑战。尽管结果的可变性可能在不同的阶段引入,例如对齐、汇总或差异表达的检测,但一个来源的可变性被系统地忽略了:测序设计,它通过分析传播,可能会引入额外的技术变化层。我们说明了在批量和单细胞测序中跨泳道拆分样本所产生的定性和定量差异。对于批量 mRNAseq 数据,我们专注于差异表达和富集分析;对于批量 ChIPseq 数据,我们研究了它对峰调用和峰的性质的影响。在单细胞水平上,我们专注于识别细胞亚群。我们依赖于用于分配细胞身份的标记物;介绍了 smartSeq 和 10× 数据。可测序片段数量的减少限制了不同预测方法所依赖的详细程度。此外,测序随机性增加了加权偏差,这与可变的测序深度和(尚未解释)测序偏差相符。随后,我们观察到跨技术、实验背景、生物体和组织的测序复杂性总体降低,生物信号失真。