IEEE J Biomed Health Inform. 2024 Aug;28(8):5007-5019. doi: 10.1109/JBHI.2024.3383610. Epub 2024 Aug 6.
In biomedical literature, biological pathways are commonly described through a combination of images and text. These pathways contain valuable information, including genes and their relationships, which provide insight into biological mechanisms and precision medicine. Curating pathway information across the literature enables the integration of this information to build a comprehensive knowledge base. While some studies have extracted pathway information from images and text independently, they often overlook the correspondence between the two modalities. In this paper, we present a pathway figure curation system named pathCLIP for identifying genes and gene relations from pathway figures. Our key innovation is the use of an image-text contrastive learning model to learn coordinated embeddings of image snippets and text descriptions of genes and gene relations, thereby improving curation. Our validation results, using pathway figures from PubMed, showed that our multimodal model outperforms models using only a single modality. Additionally, our system effectively curates genes and gene relations from multiple literature sources. Two case studies on extracting pathway information from literature of non-small cell lung cancer and Alzheimer's disease further demonstrate the usefulness of our curated pathway information in enhancing related pathways in the KEGG database.
在生物医学文献中,生物途径通常通过图像和文本的组合来描述。这些途径包含有价值的信息,包括基因及其关系,为深入了解生物机制和精准医学提供了线索。对文献中的途径信息进行编目,能够实现这些信息的整合,构建一个全面的知识库。虽然有些研究已经分别从图像和文本中提取了途径信息,但它们往往忽略了两种模式之间的对应关系。在本文中,我们提出了一个名为 pathCLIP 的途径图编目系统,用于从途径图中识别基因和基因关系。我们的关键创新是使用图像-文本对比学习模型来学习图像片段和基因及基因关系的文本描述的协调嵌入,从而提高编目效果。我们使用来自 PubMed 的途径图进行验证的结果表明,我们的多模态模型优于仅使用单一模态的模型。此外,我们的系统能够有效地从多个文献来源中编目基因和基因关系。对非小细胞肺癌和阿尔茨海默病文献中提取途径信息的两个案例研究进一步证明了我们编目途径信息在增强 KEGG 数据库中相关途径方面的有用性。