Chang Matthew S, Martinez Katherine A, Lattimore Chayil C, Gobin Christina M, Newsom Kimberly J, Fredenburg Kristianna M
Department of Pathology, Immunology, and Laboratory Medicine, University of Florida, Gainesville, FL 32610, United States.
Biol Methods Protoc. 2025 Jun 2;10(1):bpaf043. doi: 10.1093/biomethods/bpaf043. eCollection 2025.
Cancer cell lines have provided invaluable preclinical mechanistic data for cancer health disparities research. Although there are several studies that detail ancestry inference methods using microarray data, there are none that provide investigators with documentation of ancestry inference methods using sequencing data. Here, we describe our computational workflow for inferring genetic ancestry using either whole genome sequencing (WGS) or RNA-sequencing (RNA-seq) data from cancer cell lines. RNA-seq and WGS datasets were generated from four head and neck cancer cell lines with self-identified race/ethnicity (SIRE) as either White or Black. Our workflow included variant calling and genotype imputation via Illumina DRAGEN pipelines, merging genotyping datasets with the 1000 Genomes Project (1KGP), single nucleotide polymorphism (SNP) filtering via PLINK, and ancestry inference with ADMIXTURE. We encountered challenges in workflow development with SNP filtering and clustering of 1KGP superpopulations. Adjusting stringency of filtering parameters to a window size of 100 kb and threshold of 0.8 resulted in 312,821 SNPs remaining for the RNA-seq dataset and 1,569,578 SNPs remaining for the WGS dataset. Clustering with 1KGP improved with a panel of 291 ancestry informative markers. To estimate proportions of genetic ancestry, we used all filtered SNPs. For the WGS dataset, both clustering and genetic ancestry proportions for each cancer cell line showed concurrence with SIRE. In conclusion, our optimized workflow offers investigators a robust approach for transforming cancer cell line sequencing data to infer genetic ancestry and suggests that WGS datasets are superior to RNA-seq datasets in clustering superpopulations and more accurately estimating genetic ancestry.
癌细胞系为癌症健康差异研究提供了宝贵的临床前机制数据。尽管有几项研究详细介绍了使用微阵列数据的血统推断方法,但没有一项研究为研究人员提供使用测序数据的血统推断方法的文档。在这里,我们描述了我们使用来自癌细胞系的全基因组测序(WGS)或RNA测序(RNA-seq)数据推断遗传血统的计算工作流程。RNA-seq和WGS数据集来自四个头颈部癌细胞系,其自我识别种族/族裔(SIRE)为白人或黑人。我们的工作流程包括通过Illumina DRAGEN管道进行变异调用和基因型填充,将基因分型数据集与千人基因组计划(1KGP)合并,通过PLINK进行单核苷酸多态性(SNP)过滤,以及使用ADMIXTURE进行血统推断。我们在工作流程开发中遇到了SNP过滤和1KGP超级群体聚类方面的挑战。将过滤参数的严格度调整为窗口大小100 kb和阈值0.8后,RNA-seq数据集剩余312,821个SNP,WGS数据集剩余1,569,578个SNP。使用一组291个血统信息标记进行1KGP聚类得到了改进。为了估计遗传血统的比例,我们使用了所有过滤后的SNP。对于WGS数据集,每个癌细胞系的聚类和遗传血统比例都与SIRE一致。总之,我们优化的工作流程为研究人员提供了一种强大的方法,可将癌细胞系测序数据转化为推断遗传血统,并表明WGS数据集在聚类超级群体和更准确地估计遗传血统方面优于RNA-seq数据集。