基于梯度提升决策树（GBDT）和逻辑回归（LR）的多属性科学文献检索与排序模型

Multi-attribute scientific documents retrieval and ranking model based on GBDT and LR.

作者信息

Tian Xuedong, Wang Jiameng, Wen Yu, Ma Hongyan

机构信息

School of Cyber Security and Computer, Hebei University, Baoding 071002, China.

Hebei Machine Vision Engineering Research Center, Hebei University, Baoding 071002, China.

出版信息

Math Biosci Eng. 2022 Feb 10;19(4):3748-3766. doi: 10.3934/mbe.2022172.

DOI:10.3934/mbe.2022172

PMID:35341272

Abstract

Scientific documents contain a large number of mathematical expressions and texts containing mathematical semantics. Simply using mathematical expressions or text to retrieve scientific documents can hardly meet retrieval needs. The real difficulty in retrieving scientific documents is to effectively integrate mathematical expressions and related textual features. Therefore, this study proposes a multi-attribute scientific documents retrieval and ranking model based on GBDT (gradient boosting decision tree) and LR (logistic regression) by integrating the expressions and text contained in scientific documents. First, the similarities of the five attributes are calculated, including mathematical expression symbols, mathematical expression sub-forms, mathematical expression context, scientific document keywords and the frequency of mathematical expressions. Next, the GBDT model is used to discretize and reorganize the five attributes. Finally, the reorganized features are input into the LR model, and the final retrieval and ranking results of scientific documents are obtained. The experiment in this study was carried out on the NTCIR dataset. The average value of the final MAP@20 of the scientific document recall was 81.92%. The average value of the scientific document ranking nDCG@20 was 86.05%.

摘要

科学文档包含大量数学表达式以及包含数学语义的文本。单纯使用数学表达式或文本检索科学文档很难满足检索需求。检索科学文档的真正难点在于有效整合数学表达式和相关文本特征。因此，本研究通过整合科学文档中包含的表达式和文本，提出了一种基于梯度提升决策树（GBDT）和逻辑回归（LR）的多属性科学文档检索与排序模型。首先，计算五个属性的相似度，包括数学表达式符号、数学表达式子形式、数学表达式上下文、科学文档关键词以及数学表达式的频率。接着，使用GBDT模型对这五个属性进行离散化和重组。最后，将重组后的特征输入到LR模型中，得到科学文档的最终检索和排序结果。本研究中的实验是在NTCIR数据集上进行的。科学文档召回率的最终MAP@20平均值为81.92%。科学文档排序的nDCG@20平均值为86.05%。