Brechtmann Felix, Bechtler Thibault, Londhe Shubhankar, Mertes Christian, Gagneur Julien
TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
Munich Center for Machine Learning, Munich, Germany.
NAR Genom Bioinform. 2023 Nov 2;5(4):lqad095. doi: 10.1093/nargab/lqad095. eCollection 2023 Dec.
Functional gene embeddings, numerical vectors capturing gene function, provide a promising way to integrate functional gene information into machine learning models. These embeddings are learnt by applying self-supervised machine-learning algorithms on various data types including quantitative omics measurements, protein-protein interaction networks and literature. However, downstream evaluations comparing alternative data modalities used to construct functional gene embeddings have been lacking. Here we benchmarked functional gene embeddings obtained from various data modalities for predicting disease-gene lists, cancer drivers, phenotype-gene associations and scores from genome-wide association studies. Off-the-shelf predictors trained on precomputed embeddings matched or outperformed dedicated state-of-the-art predictors, demonstrating their high utility. Embeddings based on literature and protein-protein interactions inferred from low-throughput experiments outperformed embeddings derived from genome-wide experimental data (transcriptomics, deletion screens and protein sequence) when predicting curated gene lists. In contrast, they did not perform better when predicting genome-wide association signals and were biased towards highly-studied genes. These results indicate that embeddings derived from literature and low-throughput experiments appear favourable in many existing benchmarks because they are biased towards well-studied genes and should therefore be considered with caution. Altogether, our study and precomputed embeddings will facilitate the development of machine-learning models in genetics and related fields.
功能基因嵌入,即捕获基因功能的数值向量,为将功能基因信息整合到机器学习模型中提供了一种很有前景的方法。这些嵌入是通过对包括定量组学测量、蛋白质-蛋白质相互作用网络和文献在内的各种数据类型应用自监督机器学习算法来学习的。然而,一直缺乏对用于构建功能基因嵌入的替代数据模式进行比较的下游评估。在这里,我们对从各种数据模式获得的功能基因嵌入进行了基准测试,以预测疾病基因列表、癌症驱动因素、表型-基因关联以及全基因组关联研究的分数。在预计算嵌入上训练的现成预测器与专用的最新预测器相当或更胜一筹,证明了它们的高实用性。在预测经过整理的基因列表时,基于文献和从低通量实验推断出的蛋白质-蛋白质相互作用的嵌入优于从全基因组实验数据(转录组学、缺失筛选和蛋白质序列)衍生的嵌入。相比之下,在预测全基因组关联信号时,它们表现并不更好,并且偏向于研究充分的基因。这些结果表明,从文献和低通量实验衍生的嵌入在许多现有基准测试中似乎更有利,因为它们偏向于研究充分的基因,因此应谨慎考虑。总之,我们的研究和预计算嵌入将促进遗传学及相关领域机器学习模型的发展。