Division of Population Health and Genomics, Ninewells Hospital, School of Medicine, University of Dundee, Dundee, DD1 9SY, UK.
Exscientia Ltd, Dundee One, River Court, 5 West Victoria Dock Road, Dundee, DD1 3JT, UK.
Sci Rep. 2023 May 24;13(1):8366. doi: 10.1038/s41598-023-35597-4.
Most biomedical knowledge is published as text, making it challenging to analyse using traditional statistical methods. In contrast, machine-interpretable data primarily comes from structured property databases, which represent only a fraction of the knowledge present in the biomedical literature. Crucial insights and inferences can be drawn from these publications by the scientific community. We trained language models on literature from different time periods to evaluate their ranking of prospective gene-disease associations and protein-protein interactions. Using 28 distinct historical text corpora of abstracts published between 1995 and 2022, we trained independent Word2Vec models to prioritise associations that were likely to be reported in future years. This study demonstrates that biomedical knowledge can be encoded as word embeddings without the need for human labelling or supervision. Language models effectively capture drug discovery concepts such as clinical tractability, disease associations, and biochemical pathways. Additionally, these models can prioritise hypotheses years before their initial reporting. Our findings underscore the potential for extracting yet-to-be-discovered relationships through data-driven approaches, leading to generalised biomedical literature mining for potential therapeutic drug targets. The Publication-Wide Association Study (PWAS) enables the prioritisation of under-explored targets and provides a scalable system for accelerating early-stage target ranking, irrespective of the specific disease of interest.
大多数生物医学知识都是以文本形式发表的,这使得使用传统的统计方法进行分析变得具有挑战性。相比之下,机器可解释的数据主要来自结构化属性数据库,而这些数据库仅代表生物医学文献中存在的知识的一小部分。科学界可以从这些出版物中得出关键的见解和推断。我们在不同时期的文献上训练语言模型,以评估它们对潜在基因-疾病关联和蛋白质-蛋白质相互作用的排名。我们使用了 28 个不同的历史摘要文本语料库,这些语料库涵盖了 1995 年至 2022 年期间发表的文章,我们训练了独立的 Word2Vec 模型,以优先考虑未来几年可能报告的关联。这项研究表明,生物医学知识可以被编码为词向量,而无需人工标记或监督。语言模型有效地捕捉到了药物发现的概念,如临床可行性、疾病关联和生化途径。此外,这些模型可以在最初报道前几年就对假说进行优先级排序。我们的发现强调了通过数据驱动方法提取尚未发现的关系的潜力,从而实现针对潜在治疗药物靶点的广义生物医学文献挖掘。全文献关联研究(PWAS)能够优先考虑未充分探索的靶点,并提供了一个可扩展的系统,用于加速早期目标排名,而不受特定关注疾病的影响。