为什么引用这个？可解释机器学习应用于新冠疫情研究文献。

Why was this cited? Explainable machine learning applied to COVID-19 research literature.

作者信息

Beranová Lucie, Joachimiak Marcin P, Kliegr Tomáš, Rabby Gollam, Sklenák Vilém

机构信息

Department of Econometrics, Faculty of Informatics and Statistics, VSE Praha, W Churchill sq. 4, Prague, Czech Republic.

Environmental Genomics and Systems Biology Division at Lawrence Berkeley National Laboratory, Berkeley, USA.

出版信息

Scientometrics. 2022;127(5):2313-2349. doi: 10.1007/s11192-022-04314-9. Epub 2022 Apr 9.

DOI:10.1007/s11192-022-04314-9

PMID:35431364

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8993675/

Abstract

Multiple studies have investigated bibliometric factors predictive of the citation count a research article will receive. In this article, we go beyond bibliometric data by using a range of machine learning techniques to find patterns predictive of citation count using both article content and available metadata. As the input collection, we use the CORD-19 corpus containing research articles-mostly from biology and medicine-applicable to the COVID-19 crisis. Our study employs a combination of state-of-the-art machine learning techniques for text understanding, including embeddings-based language model BERT, several systems for detection and semantic expansion of entities: ConceptNet, Pubtator and ScispaCy. To interpret the resulting models, we use several explanation algorithms: random forest feature importance, LIME, and Shapley values. We compare the performance and comprehensibility of models obtained by "black-box" machine learning algorithms (neural networks and random forests) with models built with rule learning (CORELS, CBA), which are intrinsically explainable. Multiple rules were discovered, which referred to biomedical entities of potential interest. Of the rules with the highest lift measure, several rules pointed to dipeptidyl peptidase4 (DPP4), a known MERS-CoV receptor and a critical determinant of camel to human transmission of the camel coronavirus (MERS-CoV). Some other interesting patterns related to the type of animal investigated were found. Articles referring to bats and camels tend to draw citations, while articles referring to most other animal species related to coronavirus are lowly cited. Bat coronavirus is the only other virus from a non-human species in the betaB clade along with the SARS-CoV and SARS-CoV-2 viruses. MERS-CoV is in a sister betaC clade, also close to human SARS coronaviruses. Thus both species linked to high citation counts harbor coronaviruses which are more phylogenetically similar to human SARS viruses. On the other hand, feline (FIPV, FCOV) and canine coronaviruses (CCOV) are in the alpha coronavirus clade and more distant from the betaB clade with human SARS viruses. Other results include detection of apparent citation bias favouring authors with western sounding names. Equal performance of TF-IDF weights and binary word incidence matrix was observed, with the latter resulting in better interpretability. The best predictive performance was obtained with a "black-box" method-neural network. The rule-based models led to most insights, especially when coupled with text representation using semantic entity detection methods. Follow-up work should focus on the analysis of citation patterns in the context of phylogenetic trees, as well on patterns referring to DPP4, which is currently considered as a SARS-Cov-2 therapeutic target.

摘要

多项研究调查了可预测研究论文被引次数的文献计量学因素。在本文中，我们超越了文献计量数据，通过使用一系列机器学习技术，利用文章内容和可用的元数据来寻找可预测被引次数的模式。作为输入数据集，我们使用了CORD-19语料库，其中包含适用于新冠疫情危机的研究论文，大部分来自生物学和医学领域。我们的研究采用了多种用于文本理解的先进机器学习技术，包括基于嵌入的语言模型BERT、用于实体检测和语义扩展的多个系统：ConceptNet、Pubtator和ScispaCy。为了解释所得模型，我们使用了几种解释算法：随机森林特征重要性、LIME和Shapley值。我们将“黑箱”机器学习算法（神经网络和随机森林）得到的模型的性能和可理解性与基于规则学习构建的模型（CORELS、CBA）进行比较，后者本质上是可解释的。发现了多个与潜在感兴趣的生物医学实体相关的规则。在提升度最高的规则中，有几条规则指向二肽基肽酶4（DPP4），它是已知的中东呼吸综合征冠状病毒（MERS-CoV）受体，也是骆驼冠状病毒（MERS-CoV）从骆驼传播给人类的关键决定因素。还发现了一些与所研究动物类型相关的其他有趣模式。提及蝙蝠和骆驼的文章往往会获得引用，而提及与冠状病毒相关的大多数其他动物物种的文章被引次数较低。蝙蝠冠状病毒是βB进化枝中除严重急性呼吸综合征冠状病毒（SARS-CoV）和严重急性呼吸综合征冠状病毒2（SARS-CoV-2）之外的唯一一种非人类物种病毒。MERS-CoV处于一个姐妹βC进化枝中，也与人类SARS冠状病毒相近。因此，与高被引次数相关的两个物种都携带与人类SARS病毒在系统发育上更相似的冠状病毒。另一方面，猫科（猫传染性腹膜炎病毒、猫冠状病毒）和犬冠状病毒属于α冠状病毒进化枝，与含有人类SARS病毒的βB进化枝距离更远。其他结果包括检测到明显的引用偏向，偏向于名字带有西方风格的作者。观察到词频逆文档频率（TF-IDF）权重和二元词出现矩阵具有相同的性能，后者具有更好的可解释性。使用“黑箱”方法——神经网络获得了最佳预测性能。基于规则的模型带来了最多的见解，特别是当与使用语义实体检测方法的文本表示相结合时。后续工作应专注于在系统发育树的背景下分析引用模式，以及关于DPP4的模式，DPP4目前被认为是SARS-CoV-2的治疗靶点。

相似文献

Why was this cited? Explainable machine learning applied to COVID-19 research literature.

Scientometrics. 2022;127(5):2313-2349. doi: 10.1007/s11192-022-04314-9. Epub 2022 Apr 9.

Role of biological Data Mining and Machine Learning Techniques in Detecting and Diagnosing the Novel Coronavirus (COVID-19): A Systematic Review.

J Med Syst. 2020 May 25;44(7):122. doi: 10.1007/s10916-020-01582-x.

Detection and full genome characterization of two beta CoV viruses related to Middle East respiratory syndrome from bats in Italy.

Virol J. 2017 Dec 19;14(1):239. doi: 10.1186/s12985-017-0907-1.

Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph.

J Biomed Semantics. 2023 Nov 28;14(1):18. doi: 10.1186/s13326-023-00298-4.

SARS-CoV-2 and Three Related Coronaviruses Utilize Multiple ACE2 Orthologs and Are Potently Blocked by an Improved ACE2-Ig.

J Virol. 2020 Oct 27;94(22). doi: 10.1128/JVI.01283-20.

Host species restriction of Middle East respiratory syndrome coronavirus through its receptor, dipeptidyl peptidase 4.

J Virol. 2014 Aug;88(16):9220-32. doi: 10.1128/JVI.00676-14. Epub 2014 Jun 4.

Properties of Coronavirus and SARS-CoV-2.

Malays J Pathol. 2020 Apr;42(1):3-11.

Receptor usage and cell entry of bat coronavirus HKU4 provide insight into bat-to-human transmission of MERS coronavirus.

Proc Natl Acad Sci U S A. 2014 Aug 26;111(34):12516-21. doi: 10.1073/pnas.1405889111. Epub 2014 Aug 11.

Multi-class classification of COVID-19 documents using machine learning algorithms.

J Intell Inf Syst. 2023;60(2):571-591. doi: 10.1007/s10844-022-00768-8. Epub 2022 Nov 29.

Permissivity of Dipeptidyl Peptidase 4 Orthologs to Middle East Respiratory Syndrome Coronavirus Is Governed by Glycosylation and Other Complex Determinants.

J Virol. 2017 Sep 12;91(19). doi: 10.1128/JVI.00534-17. Print 2017 Oct 1.

引用本文的文献

Towards Improved XAI-Based Epidemiological Research into the Next Potential Pandemic.

Life (Basel). 2024 Jun 21;14(7):783. doi: 10.3390/life14070783.

A brief review and scientometric analysis on ensemble learning methods for handling COVID-19.

Heliyon. 2024 Feb 20;10(4):e26694. doi: 10.1016/j.heliyon.2024.e26694. eCollection 2024 Feb 29.

Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph.

J Biomed Semantics. 2023 Nov 28;14(1):18. doi: 10.1186/s13326-023-00298-4.

Evaluation of editors' abilities to predict the citation potential of research manuscripts submitted to : a cohort study.

BMJ. 2022 Dec 14;379:e073880. doi: 10.1136/bmj-2022-073880.

Multi-class classification of COVID-19 documents using machine learning algorithms.

J Intell Inf Syst. 2023;60(2):571-591. doi: 10.1007/s10844-022-00768-8. Epub 2022 Nov 29.

本文引用的文献

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.

Nat Mach Intell. 2019 May;1(5):206-215. doi: 10.1038/s42256-019-0048-x. Epub 2019 May 13.

Supervised learning with word embeddings derived from PubMed captures latent knowledge about protein kinases and cancer.

NAR Genom Bioinform. 2021 Dec 8;3(4):lqab113. doi: 10.1093/nargab/lqab113. eCollection 2021 Dec.

Deep Learning in Mining Biological Data.

Cognit Comput. 2021;13(1):1-33. doi: 10.1007/s12559-020-09773-x. Epub 2021 Jan 5.

A chronicle of SARS-CoV-2: Seasonality, environmental fate, transport, inactivation, and antiviral drug resistance.

J Hazard Mater. 2021 Mar 5;405:124043. doi: 10.1016/j.jhazmat.2020.124043. Epub 2020 Oct 6.

KG-COVID-19: A Framework to Produce Customized Knowledge Graphs for COVID-19 Response.

Patterns (N Y). 2021 Jan 8;2(1):100155. doi: 10.1016/j.patter.2020.100155. Epub 2020 Nov 9.

Animal models for COVID-19.

Nature. 2020 Oct;586(7830):509-515. doi: 10.1038/s41586-020-2787-6. Epub 2020 Sep 23.

Coronavirus disease 2019 (COVID-19) in domestic animals and wildlife: advances and prospects in the development of animal models for vaccine and therapeutic research.

Hum Vaccin Immunother. 2020 Dec 1;16(12):3043-3054. doi: 10.1080/21645515.2020.1807802. Epub 2020 Sep 11.

From Local Explanations to Global Understanding with Explainable AI for Trees.

Nat Mach Intell. 2020 Jan;2(1):56-67. doi: 10.1038/s42256-019-0138-9. Epub 2020 Jan 17.

Guidelines for communicating about bats to prevent persecution in the time of COVID-19.

Biol Conserv. 2020 Aug;248:108650. doi: 10.1016/j.biocon.2020.108650. Epub 2020 Jun 3.

Effect of published papers by the Institute for Health Metrics and Evaluation on the impact factor of journal.

J Investig Med. 2020 Aug;68(6):1203-1204. doi: 10.1136/jim-2020-001398. Epub 2020 May 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

为什么引用这个？可解释机器学习应用于新冠疫情研究文献。

Why was this cited? Explainable machine learning applied to COVID-19 research literature.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献