Suppr超能文献

基于梯度提升决策树(GBDT)和逻辑回归(LR)的多属性科学文献检索与排序模型

Multi-attribute scientific documents retrieval and ranking model based on GBDT and LR.

作者信息

Tian Xuedong, Wang Jiameng, Wen Yu, Ma Hongyan

机构信息

School of Cyber Security and Computer, Hebei University, Baoding 071002, China.

Hebei Machine Vision Engineering Research Center, Hebei University, Baoding 071002, China.

出版信息

Math Biosci Eng. 2022 Feb 10;19(4):3748-3766. doi: 10.3934/mbe.2022172.

Abstract

Scientific documents contain a large number of mathematical expressions and texts containing mathematical semantics. Simply using mathematical expressions or text to retrieve scientific documents can hardly meet retrieval needs. The real difficulty in retrieving scientific documents is to effectively integrate mathematical expressions and related textual features. Therefore, this study proposes a multi-attribute scientific documents retrieval and ranking model based on GBDT (gradient boosting decision tree) and LR (logistic regression) by integrating the expressions and text contained in scientific documents. First, the similarities of the five attributes are calculated, including mathematical expression symbols, mathematical expression sub-forms, mathematical expression context, scientific document keywords and the frequency of mathematical expressions. Next, the GBDT model is used to discretize and reorganize the five attributes. Finally, the reorganized features are input into the LR model, and the final retrieval and ranking results of scientific documents are obtained. The experiment in this study was carried out on the NTCIR dataset. The average value of the final MAP@20 of the scientific document recall was 81.92%. The average value of the scientific document ranking nDCG@20 was 86.05%.

摘要

科学文档包含大量数学表达式以及包含数学语义的文本。单纯使用数学表达式或文本检索科学文档很难满足检索需求。检索科学文档的真正难点在于有效整合数学表达式和相关文本特征。因此,本研究通过整合科学文档中包含的表达式和文本,提出了一种基于梯度提升决策树(GBDT)和逻辑回归(LR)的多属性科学文档检索与排序模型。首先,计算五个属性的相似度,包括数学表达式符号、数学表达式子形式、数学表达式上下文、科学文档关键词以及数学表达式的频率。接着,使用GBDT模型对这五个属性进行离散化和重组。最后,将重组后的特征输入到LR模型中,得到科学文档的最终检索和排序结果。本研究中的实验是在NTCIR数据集上进行的。科学文档召回率的最终MAP@20平均值为81.92%。科学文档排序的nDCG@20平均值为86.05%。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验