Wang Pengfei, Liu Wenhao, Wang Jiajia, Liu Yana, Li Pengjiang, Xu Ping, Cui Wentao, Zhang Ran, Long Qingqing, Hu Zhilong, Fang Chen, Dong Jingxi, Zhang Chunyang, Chen Yan, Wang Chengrui, Liu Guole, Xie Hanyu, Zhang Yiyang, Xiao Meng, Chen Shubai, Jiang Haiping, Chen Yiqiang, Yang Ge, Zhang Shihua, Meng Zhen, Wang Xuezhi, Feng Guihai, Li Xin, Zhou Yuanchun
Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100083, China.
University of Chinese Academy of Sciences, Beijing, 100864, China.
Adv Sci (Weinh). 2025 Jul;12(25):e2500870. doi: 10.1002/advs.202500870. Epub 2025 May 2.
Emerging single-cell sequencing technology has generated large amounts of data, allowing analysis of cellular dynamics and gene regulation at the single-cell resolution. Advances in artificial intelligence enhance life sciences research by delivering critical insights and optimizing data analysis processes. However, inconsistent data processing quality and standards remain to be a major challenge. Here scCompass is proposed, which provides a comprehensive resource designed to build large-scale, multi-species, and model-friendly single-cell data collection. By applying standardized data pre-processing, scCompass integrates and curates transcriptomic data from nearly 105 million single cells across 13 species. Using this extensive dataset, it is able to identify stable expression genes (SEGs) and organ-specific expression genes (OSGs) in humans and mice. Different scalable datasets are provided that can be easily adapted for AI model training and the pretrained checkpoints with state-of-the-art single-cell foundation models. In summary, scCompass is highly efficient and scalable database for AI-ready, which combined with user-friendly data sharing, visualization, and online analysis, greatly simplifies data access and exploitation for researchers in single-cell biology (http://www.bdbe.cn/kun).
新兴的单细胞测序技术产生了大量数据,使得在单细胞分辨率下分析细胞动态和基因调控成为可能。人工智能的进展通过提供关键见解和优化数据分析过程,提升了生命科学研究。然而,数据处理质量和标准不一致仍然是一个重大挑战。在此,我们提出了scCompass,它提供了一个全面的资源,旨在构建大规模、多物种且对模型友好的单细胞数据集合。通过应用标准化的数据预处理,scCompass整合并整理了来自13个物种近1.05亿个单细胞的转录组数据。利用这个庞大的数据集,它能够识别出人类和小鼠中的稳定表达基因(SEGs)和器官特异性表达基因(OSGs)。提供了不同的可扩展数据集,这些数据集能够轻松地适用于人工智能模型训练以及带有最先进单细胞基础模型的预训练检查点。总之,scCompass是一个高度高效且可扩展的面向人工智能的数据库,它结合了用户友好的数据共享、可视化和在线分析功能,极大地简化了单细胞生物学研究人员的数据获取和利用(http://www.bdbe.cn/kun)。