Duke University, Durham, USA.
Department of Emergency Medicine, Massachusetts General Hospital, Boston, USA.
Sci Rep. 2024 Jan 29;14(1):2419. doi: 10.1038/s41598-024-52233-x.
Scientific research is driven by allocation of funding to different research projects based in part on the predicted scientific impact of the work. Data-driven algorithms can inform decision-making of scarce funding resources by identifying likely high-impact studies using bibliometrics. Compared to standardized citation-based metrics alone, we utilize a machine learning pipeline that analyzes high-dimensional relationships among a range of bibliometric features to improve the accuracy of predicting high-impact research. Random forest classification models were trained using 28 bibliometric features calculated from a dataset of 1,485,958 publications in medicine to retrospectively predict whether a publication would become high-impact. For each random forest model, the balanced accuracy score was above 0.95 and the area under the receiver operating characteristic curve was above 0.99. The high performance of high impact research prediction using our proposed models show that machine learning technologies are promising algorithms that can support funding decision-making for medical research.
科学研究是通过将资金分配给不同的研究项目来推动的,部分依据是工作的预期科学影响。数据驱动的算法可以通过使用文献计量学识别可能具有高影响力的研究,为稀缺资金资源的决策提供信息。与仅基于标准化引文计量指标相比,我们利用机器学习管道来分析一系列文献计量特征之间的高维关系,以提高预测高影响力研究的准确性。使用从医学领域 1,485,958 篇出版物的数据集中计算的 28 个文献计量特征,对随机森林分类模型进行了训练,以回顾性地预测一项出版物是否具有高影响力。对于每个随机森林模型,平衡准确性得分均高于 0.95,接收者操作特征曲线下的面积均高于 0.99。我们提出的模型对高影响力研究预测的出色表现表明,机器学习技术是一种很有前途的算法,可以为医学研究的资金决策提供支持。