Department of Computer Science, University of California, Davis, Davis, CA, 95616, USA; Genome Center, University of California, Davis, Davis, CA, 95616, USA; USDA/NSF AI Institute for Next Generation Food Systems, Davis, CA, 95616, USA.
Genome Center, University of California, Davis, Davis, CA, 95616, USA; USDA/NSF AI Institute for Next Generation Food Systems, Davis, CA, 95616, USA.
Comput Biol Med. 2024 Oct;181:109072. doi: 10.1016/j.compbiomed.2024.109072. Epub 2024 Aug 30.
Automated generation of knowledge graphs that accurately capture published information can help with knowledge organization and access, which have the potential to accelerate discovery and innovation. Here, we present an integrated pipeline to construct a large-scale knowledge graph using large language models in an active learning setting. We apply our pipeline to the association of raw food, ingredients, and chemicals, a domain that lacks such knowledge resources. By using an iterative active learning approach of 4120 manually curated premise-hypothesis pairs as training data for ten consecutive cycles, the entailment model extracted 230,848 food-chemical composition relationships from 155,260 scientific papers, with 106,082 (46.0 %) of them never been reported in any published database. To augment the knowledge incorporated in the knowledge graph, we further incorporated information from 5 external databases and ontology sources. We then applied a link prediction model to identify putative food-chemical relationships that were not part of the constructed knowledge graph. Validation of the 443 hypotheses generated by the link prediction model resulted in 355 new food-chemical relationships, while results show that the model score correlates well (R = 0.70) with the probability of a novel finding. This work demonstrates how automated learning from literature at scale can accelerate discovery and support practical applications through reproducible, evidence-based capture of latent interactions of diverse entities, such as food and chemicals.
自动化生成能够准确捕捉已发表信息的知识图谱有助于知识组织和获取,从而有可能加速发现和创新。在这里,我们提出了一个集成的流水线,用于在主动学习环境中使用大型语言模型构建大规模知识图谱。我们将该流水线应用于原始食物、成分和化学物质的关联,该领域缺乏此类知识资源。通过使用迭代主动学习方法,将 4120 对经过人工策展的前提-假设对作为训练数据,连续进行十个周期,蕴涵模型从 155260 篇科学论文中提取了 230848 个食物-化学成分关系,其中 106082 个(46.0%)从未在任何已发表的数据库中报道过。为了扩充知识图谱中包含的知识,我们进一步整合了来自 5 个外部数据库和本体源的信息。然后,我们应用链接预测模型来识别未包含在构建的知识图谱中的假定食物-化学关系。对链接预测模型生成的 443 个假设进行验证,结果得到 355 个新的食物-化学关系,结果表明模型得分与新发现的概率相关性较好(R=0.70)。这项工作展示了如何通过从文献中自动学习大规模知识,加速发现并通过对不同实体(如食物和化学物质)的潜在相互作用进行可重复、基于证据的捕获来支持实际应用。