Department of Pathology, Johns Hopkins University SOM, Baltimore, MD, 21205, USA.
McKusick-Nathans Institute, Department of Genetic Medicine, Johns Hopkins University SOM, Baltimore, MD, 21205, USA.
Nat Commun. 2020 Apr 22;11(1):1933. doi: 10.1038/s41467-020-15821-9.
A challenge of next generation sequencing is read contamination. We use Genotype-Tissue Expression (GTEx) datasets and technical metadata along with RNA-seq datasets from other studies to understand factors that contribute to contamination. Here we report, of 48 analyzed tissues in GTEx, 26 have variant co-expression clusters of four highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicate contamination. Sample contamination is strongly associated with a sample being sequenced on the same day as a tissue that natively expresses those genes. Discrepant SNPs across four contaminating genes validate the contamination. Low-level contamination affects ~40% of samples and leads to numerous eQTL assignments in inappropriate tissues among these 18 genes. This type of contamination occurs widely, impacting bulk and single cell (scRNA-seq) data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses.
下一代测序的一个挑战是读污染。我们使用基因型组织表达 (GTEx) 数据集和技术元数据以及来自其他研究的 RNA-seq 数据集来了解导致污染的因素。在这里,我们报告了在 GTEx 分析的 48 种组织中,有 26 种组织具有四个高度表达且富含胰腺的基因 (PRSS1、PNLIP、CLPS 和/或 CELA3A) 的变异共表达簇。来自其他组织的另外 14 个高表达基因也表明存在污染。样本污染与同一天对天然表达这些基因的组织进行测序的样本强烈相关。四个污染基因之间的差异 SNP 验证了污染的存在。低水平污染影响了约 40%的样本,并导致在这 18 个基因中在不合适的组织中出现了许多 eQTL 分配。这种类型的污染广泛存在,影响批量和单细胞 (scRNA-seq) 数据集分析。总之,高表达、组织丰富的基因基础上污染了 GTEx 和其他数据集,影响了分析。