Suppr超能文献

新冠疫情研究的影响:一项使用机器学习和领域无关知识图谱预测有影响力学术文献的研究。

Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph.

机构信息

L3S Research Center, Leibniz University Hannover, Hanover, Germany.

Department of Information and Knowledge Engineering, Prague University of Economics and Business, nám. Winstona Churchilla 1938/4, 120 00, Prague, Czech Republic.

出版信息

J Biomed Semantics. 2023 Nov 28;14(1):18. doi: 10.1186/s13326-023-00298-4.

Abstract

Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data.

摘要

多项研究调查了文献计量学特征和未分类的学术文献,以用于有影响力的学术文献预测任务。在本文中,我们描述了我们的工作,试图超越文献计量元数据来预测有影响力的学术文献。此外,这项工作还研究了在分类学术文献上的有影响力的学术文献预测任务。我们还引入了一种新方法,通过使用独立于领域的知识图谱来增强文档表示方法,以使用分类学术内容找到有影响力的学术文献。作为输入集合,我们使用了关于 COVID-19 主题的学术文献的世界卫生组织 (WHO) 语料库。本研究检验了机器学习的不同文档表示方法,包括 TF-IDF、BOW 和基于嵌入的语言模型 (BERT)。TF-IDF 文档表示方法比其他方法效果更好。在所测试的各种机器学习方法中,逻辑回归在学术文献类别分类方面优于其他方法,随机森林算法在借助独立于领域的知识图谱(特别是 DBpedia)增强文档表示方法以预测具有分类学术内容的有影响力的学术文献方面获得了最佳结果。在这种情况下,我们的研究将最先进的机器学习方法与 BOW 文档表示方法相结合。我们还使用 DBpedia 的直接类型(RDF 类型)和非限定关系增强了 BOW 文档表示。通过这个实验,我们没有发现增强的文档表示对学术文献类别分类有任何影响。我们在具有分类数据的有影响力的学术文献预测中发现了效果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bb20/10683290/09acd7df7e70/13326_2023_298_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验