Department of Biomedical Systems Informatics and Brain Korea 21 PLUS Project for Medical Science, Yonsei University College of Medicine, Seoul, 03722, South Korea.
Genome Biol. 2019 Nov 11;20(1):231. doi: 10.1186/s13059-019-1849-2.
Patient-derived xenograft and cell line models are popular models for clinical cancer research. However, the inevitable inclusion of a mouse genome in a patient-derived model is a remaining concern in the analysis. Although multiple tools and filtering strategies have been developed to account for this, research has yet to demonstrate the exact impact of the mouse genome and the optimal use of these tools and filtering strategies in an analysis pipeline.
We construct a benchmark dataset of 5 liver tissues from 3 mouse strains using human whole-exome sequencing kit. Next-generation sequencing reads from mouse tissues are mappable to 49% of the human genome and 409 cancer genes. In total, 1,207,556 mouse-specific alleles are aligned to the human genome reference, including 467,232 (38.7%) alleles with high sensitivity to contamination, which are pervasive causes of false cancer mutations in public databases and are signatures for predicting global contamination. Next, we assess the performance of 8 filtering methods in terms of mouse read filtration and reduction of mouse-specific alleles. All filtering tools generally perform well, although differences in algorithm strictness and efficiency of mouse allele removal are observed. Therefore, we develop a best practice pipeline that contains the estimation of contamination level, mouse read filtration, and variant filtration.
The inclusion of mouse cells in patient-derived models hinders genomic analysis and should be addressed carefully. Our suggested guidelines improve the robustness and maximize the utility of genomic analysis of these models.
患者来源的异种移植和细胞系模型是临床癌症研究中常用的模型。然而,在分析中不可避免地包含了小鼠基因组,这仍然是一个令人关注的问题。尽管已经开发了多种工具和过滤策略来解决这个问题,但研究尚未证明小鼠基因组的确切影响,以及在分析管道中最佳使用这些工具和过滤策略。
我们使用人类全外显子测序试剂盒构建了 3 个小鼠品系 5 个肝组织的基准数据集。来自小鼠组织的下一代测序reads 可映射到人类基因组的 49%和 409 个癌症基因。总共,1207556 个小鼠特异性等位基因与人类基因组参考序列对齐,包括 467232 个(38.7%)具有高污染敏感性的等位基因,这些等位基因是公共数据库中假癌症突变的普遍原因,也是预测全局污染的特征。接下来,我们评估了 8 种过滤方法在过滤小鼠reads 和减少小鼠特异性等位基因方面的性能。所有过滤工具通常表现良好,尽管观察到算法严格性和去除小鼠等位基因的效率存在差异。因此,我们开发了一种最佳实践管道,其中包含污染水平估计、小鼠 read 过滤和变异过滤。
患者来源模型中包含的小鼠细胞阻碍了基因组分析,应谨慎处理。我们提出的指南提高了这些模型的基因组分析的稳健性和最大利用价值。