Yang Xiaodong, Liu Guole, Feng Guihai, Bu Dechao, Wang Pengfei, Jiang Jie, Chen Shubai, Yang Qinmeng, Miao Hefan, Zhang Yiyang, Man Zhenpeng, Liang Zhongming, Wang Zichen, Li Yaning, Li Zheng, Liu Yana, Tian Yao, Liu Wenhao, Li Cong, Li Ao, Dong Jingxi, Hu Zhilong, Fang Chen, Cui Lina, Deng Zixu, Jiang Haiping, Cui Wentao, Zhang Jiahao, Yang Zhaohui, Li Handong, He Xingjian, Zhong Liqun, Zhou Jiaheng, Wang Zijian, Long Qingqing, Xu Ping, Wang Hongmei, Meng Zhen, Wang Xuezhi, Wang Yangang, Wang Yong, Zhang Shihua, Guo Jingtao, Zhao Yi, Zhou Yuanchun, Li Fei, Liu Jing, Chen Yiqiang, Yang Ge, Li Xin
State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.
Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.
Cell Res. 2024 Dec;34(12):830-845. doi: 10.1038/s41422-024-01034-y. Epub 2024 Oct 8.
Deciphering universal gene regulatory mechanisms in diverse organisms holds great potential for advancing our knowledge of fundamental life processes and facilitating clinical applications. However, the traditional research paradigm primarily focuses on individual model organisms and does not integrate various cell types across species. Recent breakthroughs in single-cell sequencing and deep learning techniques present an unprecedented opportunity to address this challenge. In this study, we built an extensive dataset of over 120 million human and mouse single-cell transcriptomes. After data preprocessing, we obtained 101,768,420 single-cell transcriptomes and developed a knowledge-informed cross-species foundation model, named GeneCompass. During pre-training, GeneCompass effectively integrated four types of prior biological knowledge to enhance our understanding of gene regulatory mechanisms in a self-supervised manner. By fine-tuning for multiple downstream tasks, GeneCompass outperformed state-of-the-art models in diverse applications for a single species and unlocked new realms of cross-species biological investigations. We also employed GeneCompass to search for key factors associated with cell fate transition and showed that the predicted candidate genes could successfully induce the differentiation of human embryonic stem cells into the gonadal fate. Overall, GeneCompass demonstrates the advantages of using artificial intelligence technology to decipher universal gene regulatory mechanisms and shows tremendous potential for accelerating the discovery of critical cell fate regulators and candidate drug targets.
破译不同生物体中的通用基因调控机制对于推进我们对基本生命过程的认识以及促进临床应用具有巨大潜力。然而,传统的研究范式主要集中在单个模式生物上,并未整合跨物种的各种细胞类型。单细胞测序和深度学习技术的最新突破为应对这一挑战提供了前所未有的机会。在本研究中,我们构建了一个包含超过1.2亿个人类和小鼠单细胞转录组的广泛数据集。经过数据预处理后,我们获得了101,768,420个单细胞转录组,并开发了一个名为GeneCompass的基于知识的跨物种基础模型。在预训练期间,GeneCompass有效地整合了四种类型的先前生物学知识,以自监督的方式增强我们对基因调控机制的理解。通过针对多个下游任务进行微调,GeneCompass在单个物种的各种应用中优于现有最先进的模型,并开启了跨物种生物学研究的新领域。我们还使用GeneCompass搜索与细胞命运转变相关的关键因素,并表明预测的候选基因能够成功诱导人类胚胎干细胞向性腺命运分化。总体而言,GeneCompass展示了使用人工智能技术破译通用基因调控机制的优势,并显示出在加速关键细胞命运调节因子和候选药物靶点发现方面的巨大潜力。