• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

功能基因嵌入中输入数据模态选择的评估。

Evaluation of input data modality choices on functional gene embeddings.

作者信息

Brechtmann Felix, Bechtler Thibault, Londhe Shubhankar, Mertes Christian, Gagneur Julien

机构信息

TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.

Munich Center for Machine Learning, Munich, Germany.

出版信息

NAR Genom Bioinform. 2023 Nov 2;5(4):lqad095. doi: 10.1093/nargab/lqad095. eCollection 2023 Dec.

DOI:10.1093/nargab/lqad095
PMID:37942285
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10629286/
Abstract

Functional gene embeddings, numerical vectors capturing gene function, provide a promising way to integrate functional gene information into machine learning models. These embeddings are learnt by applying self-supervised machine-learning algorithms on various data types including quantitative omics measurements, protein-protein interaction networks and literature. However, downstream evaluations comparing alternative data modalities used to construct functional gene embeddings have been lacking. Here we benchmarked functional gene embeddings obtained from various data modalities for predicting disease-gene lists, cancer drivers, phenotype-gene associations and scores from genome-wide association studies. Off-the-shelf predictors trained on precomputed embeddings matched or outperformed dedicated state-of-the-art predictors, demonstrating their high utility. Embeddings based on literature and protein-protein interactions inferred from low-throughput experiments outperformed embeddings derived from genome-wide experimental data (transcriptomics, deletion screens and protein sequence) when predicting curated gene lists. In contrast, they did not perform better when predicting genome-wide association signals and were biased towards highly-studied genes. These results indicate that embeddings derived from literature and low-throughput experiments appear favourable in many existing benchmarks because they are biased towards well-studied genes and should therefore be considered with caution. Altogether, our study and precomputed embeddings will facilitate the development of machine-learning models in genetics and related fields.

摘要

功能基因嵌入,即捕获基因功能的数值向量,为将功能基因信息整合到机器学习模型中提供了一种很有前景的方法。这些嵌入是通过对包括定量组学测量、蛋白质-蛋白质相互作用网络和文献在内的各种数据类型应用自监督机器学习算法来学习的。然而,一直缺乏对用于构建功能基因嵌入的替代数据模式进行比较的下游评估。在这里,我们对从各种数据模式获得的功能基因嵌入进行了基准测试,以预测疾病基因列表、癌症驱动因素、表型-基因关联以及全基因组关联研究的分数。在预计算嵌入上训练的现成预测器与专用的最新预测器相当或更胜一筹,证明了它们的高实用性。在预测经过整理的基因列表时,基于文献和从低通量实验推断出的蛋白质-蛋白质相互作用的嵌入优于从全基因组实验数据(转录组学、缺失筛选和蛋白质序列)衍生的嵌入。相比之下,在预测全基因组关联信号时,它们表现并不更好,并且偏向于研究充分的基因。这些结果表明,从文献和低通量实验衍生的嵌入在许多现有基准测试中似乎更有利,因为它们偏向于研究充分的基因,因此应谨慎考虑。总之,我们的研究和预计算嵌入将促进遗传学及相关领域机器学习模型的发展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/a360bc48e8bb/lqad095fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/c0c9b4c5bf82/lqad095fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/e12edd37d6d8/lqad095fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/6d8d2c4038cc/lqad095fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/a95cb76221e9/lqad095fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/a360bc48e8bb/lqad095fig5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/c0c9b4c5bf82/lqad095fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/e12edd37d6d8/lqad095fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/6d8d2c4038cc/lqad095fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/a95cb76221e9/lqad095fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1942/10629286/a360bc48e8bb/lqad095fig5.jpg

相似文献

1
Evaluation of input data modality choices on functional gene embeddings.功能基因嵌入中输入数据模态选择的评估。
NAR Genom Bioinform. 2023 Nov 2;5(4):lqad095. doi: 10.1093/nargab/lqad095. eCollection 2023 Dec.
2
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量:在大规模上创建和评估基于文献的生物医学概念嵌入。
PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.
3
HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.HPO2Vec+:利用异构知识资源丰富人类表型本体的节点嵌入。
J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.
4
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
5
Sequence Representations and Their Utility for Predicting Protein-Protein Interactions.序列表示及其在预测蛋白质-蛋白质相互作用中的效用。
IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):646-657. doi: 10.1109/TCBB.2021.3137325. Epub 2023 Feb 3.
6
Mining hidden knowledge: embedding models of cause-effect relationships curated from the biomedical literature.挖掘隐藏知识:嵌入从生物医学文献中整理出的因果关系模型。
Bioinform Adv. 2022 Apr 7;2(1):vbac022. doi: 10.1093/bioadv/vbac022. eCollection 2022.
7
Integrating node embeddings and biological annotations for genes to predict disease-gene associations.整合基因的节点嵌入和生物学注释以预测疾病-基因关联。
BMC Syst Biol. 2018 Dec 31;12(Suppl 9):138. doi: 10.1186/s12918-018-0662-y.
8
A best-match approach for gene set analyses in embedding spaces.一种在嵌入空间中进行基因集分析的最佳匹配方法。
Genome Res. 2024 Oct 11;34(9):1421-1433. doi: 10.1101/gr.279141.124.
9
Using Word Embeddings to Learn a Better Food Ontology.使用词嵌入来学习更好的食品本体。
Front Artif Intell. 2020 Nov 26;3:584784. doi: 10.3389/frai.2020.584784. eCollection 2020.
10
SFGAE: a self-feature-based graph autoencoder model for miRNA-disease associations prediction.SFGAE:一种基于自特征的图自动编码器模型,用于 miRNA-疾病关联预测。
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac340.

引用本文的文献

1
AI-powered precision medicine: utilizing genetic risk factor optimization to revolutionize healthcare.人工智能驱动的精准医学:利用遗传风险因素优化彻底改变医疗保健。
NAR Genom Bioinform. 2025 May 5;7(2):lqaf038. doi: 10.1093/nargab/lqaf038. eCollection 2025 Jun.
2
Analysis of 3760 hematologic malignancies reveals rare transcriptomic aberrations of driver genes.分析 3760 例血液系统恶性肿瘤揭示了驱动基因罕见的转录组异常。
Genome Med. 2024 May 20;16(1):70. doi: 10.1186/s13073-024-01331-6.

本文引用的文献

1
Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases.利用基因特征的多基因富集来预测复杂性状和疾病的潜在基因。
Nat Genet. 2023 Aug;55(8):1267-1276. doi: 10.1038/s41588-023-01443-6. Epub 2023 Jul 13.
2
Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes.对英国生物银行394,841个外显子组中的数千种表型进行系统性单变异和基于基因的关联测试。
Cell Genom. 2022 Aug 15;2(9):100168. doi: 10.1016/j.xgen.2022.100168. eCollection 2022 Sep 14.
3
Polygenic architecture of rare coding variation across 394,783 exomes.
394,783 个外显子中罕见编码变异的多基因结构。
Nature. 2023 Feb;614(7948):492-499. doi: 10.1038/s41586-022-05684-z. Epub 2023 Feb 8.
4
Adult-Onset Dystonia with Late-Onset Epilepsy in -Related Hypomyelinating Leukodystrophy-A New Intermediate Phenotype.成人起病的肌张力障碍合并迟发性癫痫与相关的低髓鞘性脑白质营养不良——一种新的中间表型
Ann Indian Acad Neurol. 2022 May-Jun;25(3):562-565. doi: 10.4103/aian.aian_952_21. Epub 2022 Apr 20.
5
Home-cage behavior in the Stargazer mutant mouse.星爆突变鼠的笼内行为。
Sci Rep. 2022 Jul 27;12(1):12801. doi: 10.1038/s41598-022-17015-3.
6
Understudied proteins: opportunities and challenges for functional proteomics.研究不足的蛋白质:功能蛋白质组学面临的机遇与挑战
Nat Methods. 2022 Jul;19(7):774-779. doi: 10.1038/s41592-022-01454-x.
7
De novo variants in ATP2B1 lead to neurodevelopmental delay.ATP2B1 中的新生变异导致神经发育迟缓。
Am J Hum Genet. 2022 May 5;109(5):944-952. doi: 10.1016/j.ajhg.2022.03.009. Epub 2022 Mar 30.
8
Sparse dictionary learning recovers pleiotropy from human cell fitness screens.稀疏字典学习从人类细胞适应性筛选中恢复多效性。
Cell Syst. 2022 Apr 20;13(4):286-303.e10. doi: 10.1016/j.cels.2021.12.005. Epub 2022 Jan 31.
9
Ensembl 2022.Ensembl 2022.
Nucleic Acids Res. 2022 Jan 7;50(D1):D988-D995. doi: 10.1093/nar/gkab1049.
10
Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。
PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.