Hong Sung Chul, Muyas Francesc, Cortés-Ciriano Isidro, Hormoz Sahand
Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215 USA.
European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD UK.
BMC Methods. 2025;2(1):10. doi: 10.1186/s44330-025-00029-4. Epub 2025 May 19.
Collaborative efforts, such as the Human Cell Atlas, are rapidly accumulating large amounts of single-cell data. To ensure that single-cell atlases are representative of human genetic diversity, we need to determine the ancestry of the donors from whom single-cell data are generated. Self-reporting of race and ethnicity, although important, can be biased and is not always available for the datasets already collected.
Here, we introduce scAI-SNP, a tool to infer ancestry directly from single-cell genomics data. To train scAI-SNP, we identified 4.5 million ancestry-informative single-nucleotide polymorphisms (SNPs) in the 1000 Genomes Project dataset across 3201 individuals from 26 population groups. For a query single-cell dataset, scAI-SNP uses these ancestry-informative SNPs to compute the contribution of each of the 26 population groups to the ancestry of the donor from whom the cells were obtained.
Using diverse single-cell datasets with matched whole-genome sequencing data, we show that scAI-SNP is robust to the sparsity of single-cell data, can accurately and consistently infer ancestry from samples derived from diverse types of tissues and cancer cells, and can be applied to different modalities of single-cell profiling assays, such as single-cell RNA-seq and single-cell ATAC-seq.
Finally, we argue that ensuring that single-cell atlases represent diverse ancestry, ideally alongside race and ethnicity, is ultimately important for improved and equitable health outcomes by accounting for human diversity.
The online version contains supplementary material available at 10.1186/s44330-025-00029-4.
诸如人类细胞图谱之类的合作项目正在迅速积累大量单细胞数据。为确保单细胞图谱能够代表人类遗传多样性,我们需要确定产生单细胞数据的供体的血统。种族和族裔的自我报告虽然很重要,但可能存在偏差,而且对于已经收集的数据集来说并不总是可用的。
在此,我们介绍了scAI-SNP,一种直接从单细胞基因组学数据推断血统的工具。为了训练scAI-SNP,我们在千人基因组计划数据集中,从26个种群组的3201个个体中鉴定出450万个具有血统信息的单核苷酸多态性(SNP)。对于一个查询单细胞数据集,scAI-SNP使用这些具有血统信息的SNP来计算26个种群组中每一个对获得细胞的供体血统的贡献。
使用具有匹配全基因组测序数据的各种单细胞数据集,我们表明scAI-SNP对单细胞数据的稀疏性具有鲁棒性,可以准确且一致地从源自不同类型组织和癌细胞的样本中推断血统,并且可以应用于单细胞分析检测的不同模式,如单细胞RNA测序和单细胞ATAC测序。
最后,我们认为确保单细胞图谱代表不同的血统,理想情况下同时考虑种族和族裔,对于通过考虑人类多样性来改善健康结果并实现公平最终是至关重要的。
在线版本包含可在10.1186/s44330-025-00029-4获取的补充材料。