Novartis Biomedical Research, 10675 John Jay Hopkins Drive, San Diego, CA, 92121, USA.
Department of Computer Science and Engineering, University of California, Riverside, 900 University Avenue, Riverside, CA, 92521, USA.
Nat Commun. 2024 Feb 29;15(1):1853. doi: 10.1038/s41467-024-46089-y.
Many machine learning applications in bioinformatics currently rely on matching gene identities when analyzing input gene signatures and fail to take advantage of preexisting knowledge about gene functions. To further enable comparative analysis of OMICS datasets, including target deconvolution and mechanism of action studies, we develop an approach that represents gene signatures projected onto their biological functions, instead of their identities, similar to how the word2vec technique works in natural language processing. We develop the Functional Representation of Gene Signatures (FRoGS) approach by training a deep learning model and demonstrate that its application to the Broad Institute's L1000 datasets results in more effective compound-target predictions than models based on gene identities alone. By integrating additional pharmacological activity data sources, FRoGS significantly increases the number of high-quality compound-target predictions relative to existing approaches, many of which are supported by in silico and/or experimental evidence. These results underscore the general utility of FRoGS in machine learning-based bioinformatics applications. Prediction networks pre-equipped with the knowledge of gene functions may help uncover new relationships among gene signatures acquired by large-scale OMICs studies on compounds, cell types, disease models, and patient cohorts.
目前,生物信息学中的许多机器学习应用程序在分析输入基因特征时依赖于匹配基因标识符,而无法利用关于基因功能的现有知识。为了进一步支持包括靶标去卷积和作用机制研究在内的 OMICS 数据集的比较分析,我们开发了一种方法,该方法将基因特征表示为其生物学功能,而不是其标识符,类似于 word2vec 技术在自然语言处理中的工作方式。我们通过训练深度学习模型开发了功能基因特征表示 (FRoGS) 方法,并证明其在 Broad 研究所的 L1000 数据集上的应用比仅基于基因标识符的模型更有效地进行化合物-靶标预测。通过整合其他药理学活性数据源,FRoGS 相对于现有方法显著增加了高质量化合物-靶标预测的数量,其中许多预测都得到了计算机和/或实验证据的支持。这些结果强调了 FRoGS 在基于机器学习的生物信息学应用中的普遍适用性。预先配备基因功能知识的预测网络可能有助于揭示化合物、细胞类型、疾病模型和患者队列的大规模 OMICS 研究中获得的基因特征之间的新关系。