• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过词嵌入类比任务预测药物-基因关系。

Predicting drug-gene relations via analogy tasks with word embeddings.

作者信息

Yamagiwa Hiroaki, Hashimoto Ryoma, Arakane Kiwamu, Murakami Ken, Soeda Shou, Oyama Momose, Zhu Yihua, Okada Mariko, Shimodaira Hidetoshi

机构信息

Kyoto University, Kyoto, Japan.

Recruit Co., Ltd., Tokyo, Japan.

出版信息

Sci Rep. 2025 May 18;15(1):17240. doi: 10.1038/s41598-025-01418-z.

DOI:10.1038/s41598-025-01418-z
PMID:40383732
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12086191/
Abstract

Natural language processing is utilized in a wide range of fields, where words in text are typically transformed into feature vectors called embeddings. BioConceptVec is a specific example of embeddings tailored for biology, trained on approximately 30 million PubMed abstracts using models such as skip-gram. Generally, word embeddings are known to solve analogy tasks through simple vector arithmetic. For example, subtracting the vector for man from that of king and then adding the vector for woman yields a point that lies closer to queen in the embedding space. In this study, we demonstrate that BioConceptVec embeddings, along with our own embeddings trained on PubMed abstracts, contain information about drug-gene relations and can predict target genes from a given drug through analogy computations. We also show that categorizing drugs and genes using biological pathways improves performance. Furthermore, we illustrate that vectors derived from known relations in the past can predict unknown future relations in datasets divided by year. Despite the simplicity of implementing analogy tasks as vector additions, our approach demonstrated performance comparable to that of large language models such as GPT-4 in predicting drug-gene relations.

摘要

自然语言处理在广泛的领域中得到应用,在这些领域中,文本中的单词通常会被转换为称为嵌入的特征向量。BioConceptVec是专门为生物学量身定制的嵌入的一个具体例子,它使用诸如skip-gram等模型在大约3000万篇PubMed摘要上进行训练。一般来说,词嵌入已知通过简单的向量运算来解决类比任务。例如,从“国王”的向量中减去“男人”的向量,然后加上“女人”的向量,会在嵌入空间中得到一个更接近“女王”的点。在本研究中,我们证明BioConceptVec嵌入以及我们自己在PubMed摘要上训练的嵌入包含有关药物-基因关系的信息,并且可以通过类比计算从给定药物预测靶基因。我们还表明,使用生物途径对药物和基因进行分类可以提高性能。此外,我们说明从过去的已知关系派生的向量可以预测按年份划分的数据集中未知的未来关系。尽管将类比任务实现为向量加法很简单,但我们的方法在预测药物-基因关系方面表现出与GPT-4等大型语言模型相当的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2aba/12086191/5abb485e034b/41598_2025_1418_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2aba/12086191/cf531207777d/41598_2025_1418_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2aba/12086191/5505a1987525/41598_2025_1418_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2aba/12086191/216c038b9183/41598_2025_1418_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2aba/12086191/5abb485e034b/41598_2025_1418_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2aba/12086191/cf531207777d/41598_2025_1418_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2aba/12086191/5505a1987525/41598_2025_1418_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2aba/12086191/216c038b9183/41598_2025_1418_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2aba/12086191/5abb485e034b/41598_2025_1418_Fig4_HTML.jpg

相似文献

1
Predicting drug-gene relations via analogy tasks with word embeddings.通过词嵌入类比任务预测药物-基因关系。
Sci Rep. 2025 May 18;15(1):17240. doi: 10.1038/s41598-025-01418-z.
2
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量:在大规模上创建和评估基于文献的生物医学概念嵌入。
PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.
3
Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.利用生物医学和一般领域知识库评估神经词汇嵌入中的语义关系。
BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.
4
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
5
Semantic Deep Learning: Prior Knowledge and a Type of Four-Term Embedding Analogy to Acquire Treatments for Well-Known Diseases.语义深度学习:先验知识与一种用于获取知名疾病治疗方法的四项嵌入类比。
JMIR Med Inform. 2020 Aug 6;8(8):e16948. doi: 10.2196/16948.
6
Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。
PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.
7
Improved biomedical word embeddings in the transformer era.Transformer 时代改进的生物医学词向量。
J Biomed Inform. 2021 Aug;120:103867. doi: 10.1016/j.jbi.2021.103867. Epub 2021 Jul 18.
8
Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research.通用和特定词嵌入在研究转化阶段分类中的效用
AMIA Annu Symp Proc. 2018 Dec 5;2018:1405-1414. eCollection 2018.
9
Domain specific word embeddings for natural language processing in radiology.用于放射学自然语言处理的特定领域词嵌入
J Biomed Inform. 2021 Jan;113:103665. doi: 10.1016/j.jbi.2020.103665. Epub 2020 Dec 15.
10
Fine-Tuning Word Embeddings for Hierarchical Representation of Data Using a Corpus and a Knowledge Base for Various Machine Learning Applications.使用语料库和知识库对数据进行层次表示的词向量微调,用于各种机器学习应用。
Comput Math Methods Med. 2021 Nov 16;2021:9761163. doi: 10.1155/2021/9761163. eCollection 2021.

本文引用的文献

1
KEGG: biological systems database as a model of the real world.京都基因与基因组百科全书(KEGG):作为现实世界模型的生物系统数据库。
Nucleic Acids Res. 2025 Jan 6;53(D1):D672-D677. doi: 10.1093/nar/gkae909.
2
Advancing drug-target interaction prediction: a comprehensive graph-based approach integrating knowledge graph embedding and ProtBert pretraining.推进药物-靶标相互作用预测:一种综合基于图的方法,整合知识图嵌入和 ProtBert 预训练。
BMC Bioinformatics. 2023 Dec 19;24(1):488. doi: 10.1186/s12859-023-05593-6.
3
Predicting drug characteristics using biomedical text embedding.
利用生物医学文本嵌入预测药物特性。
BMC Bioinformatics. 2022 Dec 7;23(1):526. doi: 10.1186/s12859-022-05083-1.
4
BISC: accurate inference of transcriptional bursting kinetics from single-cell transcriptomic data.BISC:从单细胞转录组数据中准确推断转录爆发动力学。
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac464.
5
BioGPT: generative pre-trained transformer for biomedical text generation and mining.BioGPT:用于生物医学文本生成和挖掘的生成式预训练转换器。
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac409.
6
ASURAT: functional annotation-driven unsupervised clustering of single-cell transcriptomes.ASURAT:基于功能注释的单细胞转录组无监督聚类。
Bioinformatics. 2022 Sep 15;38(18):4330-4336. doi: 10.1093/bioinformatics/btac541.
7
Poziotinib for EGFR exon 20-mutant NSCLC: Clinical efficacy, resistance mechanisms, and impact of insertion location on drug sensitivity.波齐替尼治疗 EGFR 外显子 20 突变型 NSCLC:临床疗效、耐药机制以及插入位置对药物敏感性的影响。
Cancer Cell. 2022 Jul 11;40(7):754-767.e6. doi: 10.1016/j.ccell.2022.06.006.
8
The JAK/STAT signaling pathway: from bench to clinic.JAK/STAT 信号通路:从基础到临床。
Signal Transduct Target Ther. 2021 Nov 26;6(1):402. doi: 10.1038/s41392-021-00791-1.
9
Bringing Light Into the Dark: A Large-Scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework.照亮黑暗:统一框架下知识图谱嵌入模型的大规模评估
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):8825-8845. doi: 10.1109/TPAMI.2021.3124805. Epub 2022 Nov 7.
10
Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。
PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.