Suppr超能文献

为什么引用这个?可解释机器学习应用于新冠疫情研究文献。

Why was this cited? Explainable machine learning applied to COVID-19 research literature.

作者信息

Beranová Lucie, Joachimiak Marcin P, Kliegr Tomáš, Rabby Gollam, Sklenák Vilém

机构信息

Department of Econometrics, Faculty of Informatics and Statistics, VSE Praha, W Churchill sq. 4, Prague, Czech Republic.

Environmental Genomics and Systems Biology Division at Lawrence Berkeley National Laboratory, Berkeley, USA.

出版信息

Scientometrics. 2022;127(5):2313-2349. doi: 10.1007/s11192-022-04314-9. Epub 2022 Apr 9.

Abstract

Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles-mostly from biology and medicine-applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by "black-box" machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a "black-box" method-neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target.

摘要

多项研究调查了可预测研究论文被引次数的文献计量学因素。在本文中,我们超越了文献计量数据,通过使用一系列机器学习技术,利用文章内容和可用的元数据来寻找可预测被引次数的模式。作为输入数据集,我们使用了CORD-19语料库,其中包含适用于新冠疫情危机的研究论文,大部分来自生物学和医学领域。我们的研究采用了多种用于文本理解的先进机器学习技术,包括基于嵌入的语言模型BERT、用于实体检测和语义扩展的多个系统:ConceptNet、Pubtator和ScispaCy。为了解释所得模型,我们使用了几种解释算法:随机森林特征重要性、LIME和Shapley值。我们将“黑箱”机器学习算法(神经网络和随机森林)得到的模型的性能和可理解性与基于规则学习构建的模型(CORELS、CBA)进行比较,后者本质上是可解释的。发现了多个与潜在感兴趣的生物医学实体相关的规则。在提升度最高的规则中,有几条规则指向二肽基肽酶4(DPP4),它是已知的中东呼吸综合征冠状病毒(MERS-CoV)受体,也是骆驼冠状病毒(MERS-CoV)从骆驼传播给人类的关键决定因素。还发现了一些与所研究动物类型相关的其他有趣模式。提及蝙蝠和骆驼的文章往往会获得引用,而提及与冠状病毒相关的大多数其他动物物种的文章被引次数较低。蝙蝠冠状病毒是βB进化枝中除严重急性呼吸综合征冠状病毒(SARS-CoV)和严重急性呼吸综合征冠状病毒2(SARS-CoV-2)之外的唯一一种非人类物种病毒。MERS-CoV处于一个姐妹βC进化枝中,也与人类SARS冠状病毒相近。因此,与高被引次数相关的两个物种都携带与人类SARS病毒在系统发育上更相似的冠状病毒。另一方面,猫科(猫传染性腹膜炎病毒、猫冠状病毒)和犬冠状病毒属于α冠状病毒进化枝,与含有人类SARS病毒的βB进化枝距离更远。其他结果包括检测到明显的引用偏向,偏向于名字带有西方风格的作者。观察到词频逆文档频率(TF-IDF)权重和二元词出现矩阵具有相同的性能,后者具有更好的可解释性。使用“黑箱”方法——神经网络获得了最佳预测性能。基于规则的模型带来了最多的见解,特别是当与使用语义实体检测方法的文本表示相结合时。后续工作应专注于在系统发育树的背景下分析引用模式,以及关于DPP4的模式,DPP4目前被认为是SARS-CoV-2的治疗靶点。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验