Clarke Daniel J B, Marino Giacomo B, Ma'ayan Avi
Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029 USA.
bioRxiv. 2025 Jun 2:2025.05.30.657124. doi: 10.1101/2025.05.30.657124.
Trained with large datasets, foundation models can capture complex patterns within these datasets to create embeddings that can be used for a variety of useful applications. Here we created a gene set foundation model that was trained on a massive collection of unlabeled gene sets from two databases: Rummagene and RummaGEO. Rummagene automatically extracts gene sets from supplemental tables of publications; and RummaGEO has gene sets automatically computed from comparing groups of samples from RNA-seq studies deposited into the gene expression omnibus. Several foundation model architectures and data sources for training were benchmarked in the task of predicting gene function. Such predictions were also compared to other state-of-the-art gene function prediction methods and models. One of the GSFM architectures achieves superior performance compared to all other methods and models. This model was used to systematically predict gene functions for all human gene. These predictions are served on gene pages that are accessible from https://gsfm.maayanlab.cloud.
通过在大型数据集上进行训练,基础模型可以捕捉这些数据集中的复杂模式,以创建可用于各种有用应用的嵌入。在这里,我们创建了一个基因集基础模型,该模型在来自两个数据库(Rummagene和RummaGEO)的大量未标记基因集上进行训练。Rummagene会自动从出版物的补充表中提取基因集;而RummaGEO则通过比较存入基因表达综合数据库的RNA测序研究中的样本组自动计算基因集。在预测基因功能的任务中,对几种基础模型架构和用于训练的数据源进行了基准测试。这些预测还与其他最先进的基因功能预测方法和模型进行了比较。与所有其他方法和模型相比,其中一种基因集基础模型(GSFM)架构具有卓越的性能。该模型被用于系统地预测所有人类基因的功能。这些预测结果在可从https://gsfm.maayanlab.cloud访问的基因页面上提供。