Zhang Heming, Huang Di, Chen Emily, Cao Dekang, Xu Tim, Dizdar Ben, Li Guangfu, Chen Yixin, Payne Philip, Province Michael, Li Fuhai
Institute for Informatics, Data Science and Biostatistics (I2DB), Washington University School of Medicine.
Department of Pediatrics, Washington University School of Medicine, Washington University in St. Louis, St. Louis, MO, USA.
bioRxiv. 2024 Aug 6:2024.08.01.606222. doi: 10.1101/2024.08.01.606222.
Generative pretrained models represent a significant advancement in natural language processing and computer vision, which can generate coherent and contextually relevant content based on the pre-training on large general datasets and fine-tune for specific tasks. Building foundation models using large scale omic data is promising to decode and understand the complex signaling language patterns within cells. Different from existing foundation models of omic data, we build a foundation model, , for multi-omic signaling (mos) graphs, in which the multi-omic data was integrated and interpreted using a multi-level signaling graph. The model was pretrained using multi-omic data of cancers in The Cancer Genome Atlas (TCGA), and fine-turned for multi-omic data of Alzheimer's Disease (AD). The experimental evaluation results showed that the model can not only improve the disease classification accuracy, but also is interpretable by uncovering disease targets and signaling interactions. And the model code are uploaded via GitHub with link: https://github.com/mosGraph/mosGraphGPT.
生成式预训练模型代表了自然语言处理和计算机视觉领域的一项重大进展,它可以基于对大型通用数据集的预训练生成连贯且与上下文相关的内容,并针对特定任务进行微调。利用大规模组学数据构建基础模型有望解码和理解细胞内复杂的信号语言模式。与现有的组学数据基础模型不同,我们为多组学信号(mos)图构建了一个基础模型 ,其中多组学数据使用多级信号图进行整合和解释。该模型使用癌症基因组图谱(TCGA)中的癌症多组学数据进行预训练,并针对阿尔茨海默病(AD)的多组学数据进行微调。实验评估结果表明,该模型不仅可以提高疾病分类准确率,还可以通过揭示疾病靶点和信号相互作用来进行解释。并且模型代码已通过GitHub上传,链接为:https://github.com/mosGraph/mosGraphGPT。