The Systems Biology Institute, Tokyo, Japan.
SBX Corporation, Tokyo, Japan.
NPJ Syst Biol Appl. 2023 Dec 18;9(1):63. doi: 10.1038/s41540-023-00324-2.
Assessing the mutagenicity of chemicals is an essential task in the drug development process. Usually, databases and other structured sources for AMES mutagenicity exist, which have been carefully and laboriously curated from scientific publications. As knowledge accumulates over time, updating these databases is always an overhead and impractical. In this paper, we first propose the problem of predicting the mutagenicity of chemicals from textual information in scientific publications. More simply, given a chemical and evidence in the natural language form from publications where the mutagenicity of the chemical is described, the goal of the model/algorithm is to predict if it is potentially mutagenic or not. For this, we first construct a golden standard data set and then propose MutaPredBERT, a prediction model fine-tuned on BioLinkBERT based on a question-answering formulation of the problem. We leverage transfer learning and use the help of large transformer-based models to achieve a Macro F1 score of >0.88 even with relatively small data for fine-tuning. Our work establishes the utility of large language models for the construction of structured sources of knowledge bases directly from scientific publications.
评估化学物质的致突变性是药物开发过程中的一项重要任务。通常,存在用于 AMES 致突变性的数据库和其他结构化来源,这些数据库是从科学出版物中精心、费力地整理出来的。随着时间的推移,知识不断积累,更新这些数据库始终是一项开销且不切实际的工作。在本文中,我们首先提出了从科学出版物中的文本信息预测化学物质致突变性的问题。更简单地说,给定一种化学物质和出版物中以自然语言形式描述的证据,模型/算法的目标是预测它是否具有潜在的致突变性。为此,我们首先构建了一个黄金标准数据集,然后提出了 MutaPredBERT,这是一种基于 BioLinkBERT 的预测模型,基于问题的问答形式进行微调。我们利用迁移学习并借助大型基于转换器的模型的帮助,即使在微调时使用相对较小的数据,也能实现大于 0.88 的宏 F1 分数。我们的工作证明了大型语言模型可用于直接从科学出版物构建结构化知识库来源。