Zhou Xiaochi, Nurkowski Daniel, Mosbach Sebastian, Akroyd Jethro, Kraft Markus
Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
CMCL Innovations, Sheraton House, Castle Park, Castle Street, Cambridge CB3 0AX, U.K.
J Chem Inf Model. 2021 Aug 23;61(8):3868-3880. doi: 10.1021/acs.jcim.1c00275. Epub 2021 Aug 2.
This paper describes the implementation and evaluation of a proof-of-concept Question Answering (QA) system for accessing chemical data from knowledge graphs (KGs) which offer data from chemical kinetics to the chemical and physical properties of species. We trained the question classification and named the entity recognition models that specialize in interpreting chemistry questions. The system has a novel design which applies a topic model to identify the question-to-ontology affiliation to handle ontologies with different structures. The topic model also helps the system to provide answers with a higher quality. Moreover, a new method that automatically generates training questions from ontologies is also implemented. The question set generated for training contains 432,989 questions under 11 types. Such a training set has been proven to be effective for training both the question classification model and the named entity recognition model. We evaluated the system using other KGQA systems as baselines. The system outperforms the chosen KGQA system answering chemistry-related questions. The QA system is also compared to the Google search engine and the WolframAlpha engine. It shows that the QA system can answer certain types of questions better than the search engines.
本文描述了一个概念验证问答(QA)系统的实现与评估,该系统用于从知识图谱(KG)中获取化学数据,这些知识图谱提供了从化学动力学到物种化学和物理性质的数据。我们训练了问题分类并命名了专门用于解释化学问题的实体识别模型。该系统具有新颖的设计,应用主题模型来识别问题与本体的关联,以处理具有不同结构的本体。主题模型还帮助系统提供更高质量的答案。此外,还实现了一种从本体自动生成训练问题的新方法。为训练生成的问题集包含11种类型下的432,989个问题。这样的训练集已被证明对训练问题分类模型和命名实体识别模型都有效。我们以其他KGQA系统为基线评估了该系统。该系统在回答与化学相关的问题方面优于所选的KGQA系统。该QA系统还与谷歌搜索引擎和WolframAlpha引擎进行了比较。结果表明,该QA系统在回答某些类型的问题时比搜索引擎表现更好。