Nicholson David N, Himmelstein Daniel S, Greene Casey S
Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
Department of Biomedical Informatics, University of Colorado School of Medicine and Center for Health Artificial Intellegence (CHAI), University of Colorado School of Medicine, Aurora, USA.
BioData Min. 2022 Oct 18;15(1):26. doi: 10.1186/s13040-022-00311-z.
Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types.
We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1.
Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.
知识图谱通过为生物医学实体提供上下文信息、构建网络以及支持高通量分析的解释,来助力生物医学研究工作。这些数据库通过人工编目来填充,随着出版物数量呈指数级增长,这一过程在规模扩展方面具有挑战性。数据编程是一种范式,它通过将数据库与写成标签函数的简单规则和启发式方法相结合,规避了这一艰巨的人工过程,标签函数是用于自动注释文本数据的程序。不幸的是,编写一个有用的标签函数需要大量的错误分析,并且是一项 nontrivial 的任务,每个函数需要花费数天时间。这个瓶颈使得用多个节点和边类型填充知识图谱实际上变得不可行。因此,我们试图通过评估标签函数如何跨多种边类型重复使用来加速标签函数创建过程。
我们获得了带有实体标签的摘要,并将这些实体进行子集化处理,使其仅包含化合物、基因和疾病提及。我们提取了包含先前描述的知识图谱 Hetionet v1 中某些生物医学实体共同提及的句子。我们训练了一个仅使用基于数据库的标签函数的基线模型,然后使用一种采样方法来衡量添加特定于边或边不匹配的标签函数组合相对于我们的基线有多大程度的改进。接下来,我们训练了一个判别模型来检测表明生物医学关系的句子,然后估计可以召回并添加到 Hetionet v1 的边类型数量。我们发现添加边不匹配的标签函数很少能改善关系提取,而控制特定于边的标签函数则可以。这种趋势有两个例外,即“化合物 - 结合 - 基因”和“基因 - 相互作用 - 基因”,它们都表明了物理关系并且显示出可转移性的迹象。在测试的各种场景中,判别模型的性能强烈依赖于生成的注释。使用针对每种边类型的最佳判别模型,我们在 Hetionet v1 中召回了近 30% 的已建立边。
我们的结果表明,这个框架可以将新的边纳入我们的源知识图谱。然而,标签函数转移的结果好坏参半。只有描述非常相似边类型的标签函数在转移时支持性能提升。我们预计,这一策略的持续发展可能为用发现结果填充生物医学知识图谱提供必要的构建模块,确保这些资源包含前沿成果。