Yamagiwa Hiroaki, Hashimoto Ryoma, Arakane Kiwamu, Murakami Ken, Soeda Shou, Oyama Momose, Zhu Yihua, Okada Mariko, Shimodaira Hidetoshi
Kyoto University, Kyoto, Japan.
Recruit Co., Ltd., Tokyo, Japan.
Sci Rep. 2025 May 18;15(1):17240. doi: 10.1038/s41598-025-01418-z.
Natural language processing is utilized in a wide range of fields, where words in text are typically transformed into feature vectors called embeddings. BioConceptVec is a specific example of embeddings tailored for biology, trained on approximately 30 million PubMed abstracts using models such as skip-gram. Generally, word embeddings are known to solve analogy tasks through simple vector arithmetic. For example, subtracting the vector for man from that of king and then adding the vector for woman yields a point that lies closer to queen in the embedding space. In this study, we demonstrate that BioConceptVec embeddings, along with our own embeddings trained on PubMed abstracts, contain information about drug-gene relations and can predict target genes from a given drug through analogy computations. We also show that categorizing drugs and genes using biological pathways improves performance. Furthermore, we illustrate that vectors derived from known relations in the past can predict unknown future relations in datasets divided by year. Despite the simplicity of implementing analogy tasks as vector additions, our approach demonstrated performance comparable to that of large language models such as GPT-4 in predicting drug-gene relations.
自然语言处理在广泛的领域中得到应用,在这些领域中,文本中的单词通常会被转换为称为嵌入的特征向量。BioConceptVec是专门为生物学量身定制的嵌入的一个具体例子,它使用诸如skip-gram等模型在大约3000万篇PubMed摘要上进行训练。一般来说,词嵌入已知通过简单的向量运算来解决类比任务。例如,从“国王”的向量中减去“男人”的向量,然后加上“女人”的向量,会在嵌入空间中得到一个更接近“女王”的点。在本研究中,我们证明BioConceptVec嵌入以及我们自己在PubMed摘要上训练的嵌入包含有关药物-基因关系的信息,并且可以通过类比计算从给定药物预测靶基因。我们还表明,使用生物途径对药物和基因进行分类可以提高性能。此外,我们说明从过去的已知关系派生的向量可以预测按年份划分的数据集中未知的未来关系。尽管将类比任务实现为向量加法很简单,但我们的方法在预测药物-基因关系方面表现出与GPT-4等大型语言模型相当的性能。
Sci Rep. 2025-5-18
PLoS Comput Biol. 2020-4-23
BMC Med Inform Decis Mak. 2018-7-23
J Biomed Inform. 2018-9-12
J Biomed Inform. 2021-8
AMIA Annu Symp Proc. 2018-12-5
J Biomed Inform. 2021-1
Nucleic Acids Res. 2025-1-6
BMC Bioinformatics. 2022-12-7
Brief Bioinform. 2022-11-19
Brief Bioinform. 2022-11-19
Bioinformatics. 2022-9-15
Signal Transduct Target Ther. 2021-11-26
IEEE Trans Pattern Anal Mach Intell. 2022-12