Ding Juncheng, Dahal Shailesh, Adhikari Bijaya, Jha Kishlay
University of North Texas, Denton, TX, USA.
University of Iowa, Iowa City, IA, USA.
Proc (IEEE Int Conf Healthc Inform). 2024 Jun;2024:285-293. doi: 10.1109/ichi61247.2024.00044. Epub 2024 Aug 22.
Hypothesis generation (HG) is a fundamental problem in biomedical text mining that uncovers plausible implicit links ( terms) between two disjoint concepts of interest ( and terms). Over the past decade, many HG approaches based on distributional statistics, graph-theoretic measures, and supervised machine learning methods have been proposed. Despite significant advances made, the existing approaches have two major limitations. First, they mainly focus on enumerating hypotheses and often neglect to rank them in a semantically meaningful way. This leads to wasted time and resources as researchers may focus on hypotheses that are ultimately not supported by experimental evidence. Second, the existing approaches are designed to rank hypotheses with only one intermediate or evidence term (referred as simple hypotheses), and thus are unable to handle hypotheses with multiple intermediate terms (referred as complex hypotheses). This is limiting because recent research has shown that the complex hypotheses could be of greater practical value than simple ones, especially in the early stages of scientific discovery. To address these issues, we propose a new HG ranking approach that leverages upon the expressive power of Graph Neural Networks (GNN) coupled with a domain-knowledge guided Noise-Contrastive Estimation (NCE) strategy to effectively rank both simple and complex biomedical hypotheses. Specifically, the message passing capabilities of GNN allows our approach to capture the rich interactions between biomedical entities and succinctly handle the complex hypotheses with variable intermediate terms. Moreover, the proposed domain knowledge-guided NCE strategy enables the ranking of complex hypotheses based on their coherence with the established biomedical knowledge. Extensive experiment results on five recognized biomedical datasets show that the proposed approach consistently outperforms the existing baselines and prioritizes hypotheses worthy of potential clinical trials.
假设生成(HG)是生物医学文本挖掘中的一个基本问题,它揭示了两个不相关的感兴趣概念(术语和术语)之间可能存在的隐含联系。在过去十年中,已经提出了许多基于分布统计、图论度量和监督机器学习方法的HG方法。尽管取得了重大进展,但现有方法存在两个主要局限性。首先,它们主要侧重于枚举假设,往往忽略以语义有意义的方式对其进行排序。这导致了时间和资源的浪费,因为研究人员可能会关注最终未得到实验证据支持的假设。其次,现有方法旨在对只有一个中间或证据术语的假设进行排序(称为简单假设),因此无法处理具有多个中间术语的假设(称为复杂假设)。这是有局限性的,因为最近的研究表明,复杂假设可能比简单假设具有更大的实用价值,尤其是在科学发现的早期阶段。为了解决这些问题,我们提出了一种新的HG排序方法,该方法利用图神经网络(GNN)的表达能力,结合领域知识引导的噪声对比估计(NCE)策略,有效地对简单和复杂的生物医学假设进行排序。具体来说,GNN的消息传递能力使我们的方法能够捕捉生物医学实体之间丰富的相互作用,并简洁地处理具有可变中间术语的复杂假设。此外,所提出的领域知识引导的NCE策略能够根据复杂假设与已建立的生物医学知识的一致性对其进行排序。在五个公认的生物医学数据集上进行的大量实验结果表明,所提出的方法始终优于现有的基线方法,并对值得进行潜在临床试验的假设进行了优先排序。