Zhong Huawen, Han Wenkai, Gomez-Cabrero David, Tegner Jesper, Gao Xin, Cui Guoxin, Aranda Manuel
BioEngineering Program, Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia.
Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia.
Nucleic Acids Res. 2025 Jan 7;53(1). doi: 10.1093/nar/gkae1316.
Cross-species single-cell RNA-seq data hold immense potential for unraveling cell type evolution and transferring knowledge between well-explored and less-studied species. However, challenges arise from interspecific genetic variation, batch effects stemming from experimental discrepancies and inherent individual biological differences. Here, we benchmarked nine data-integration methods across 20 species, encompassing 4.7 million cells, spanning eight phyla and the entire animal taxonomic hierarchy. Our evaluation reveals notable differences between the methods in removing batch effects and preserving biological variance across taxonomic distances. Methods that effectively leverage gene sequence information capture underlying biological variances, while generative model-based approaches excel in batch effect removal. SATURN demonstrates robust performance across diverse taxonomic levels, from cross-genus to cross-phylum, emphasizing its versatility. SAMap excels in integrating species beyond the cross-family level, especially for atlas-level cross-species integration, while scGen shines within or below the cross-class hierarchy. As a result, our analysis offers recommendations and guidelines for selecting suitable integration methods, enhancing cross-species single-cell RNA-seq analyses and advancing algorithm development.
跨物种单细胞RNA测序数据在揭示细胞类型进化以及在研究充分和研究较少的物种之间传递知识方面具有巨大潜力。然而,种间遗传变异、实验差异导致的批次效应以及固有的个体生物学差异带来了挑战。在这里,我们对跨越20个物种的9种数据整合方法进行了基准测试,涵盖470万个细胞,跨越8个门以及整个动物分类层级。我们的评估揭示了这些方法在消除批次效应和在分类距离上保留生物学差异方面的显著差异。有效利用基因序列信息的方法能够捕捉潜在的生物学差异,而基于生成模型的方法在消除批次效应方面表现出色。SATURN在从跨属到跨门的不同分类水平上都表现出稳健的性能,凸显了其通用性。SAMap在整合跨科以上物种方面表现出色,特别是对于图谱级别的跨物种整合,而scGen在跨类层级内部或以下表现突出。因此,我们的分析为选择合适的整合方法提供了建议和指导方针,以增强跨物种单细胞RNA测序分析并推动算法开发。