Suppr
超能文献

研究弱监督数据对出版物透明度文本挖掘模型的影响：以随机对照试验为例。

Investigating the impact of weakly supervised data on text mining models of publication transparency: a case study on randomized controlled trials.

机构信息

School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, USA.

出版信息

AMIA Jt Summits Transl Sci Proc. 2022 May 23;2022:254-263. eCollection 2022.

PMID:35854729

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9285178/

Abstract

Lack of large quantities of annotated data is a major barrier in developing effective text mining models of biomedical literature. In this study, we explored weak supervision to improve the accuracy of text classification models for assessing methodological transparency of randomized controlled trial (RCT) publications. Specifically, we used Snorkel, a framework to programmatically build training sets, and UMLS-EDA, a data augmentation method that leverages a small number of labeled examples to generate new training instances, and assessed their effect on a BioBERT-based text classification model proposed for the task in previous work. Performance improvements due to weak supervision were limited and were surpassed by gains from hyperparameter tuning. Our analysis suggests that refinements to the weak supervision strategies to better deal with multi-label case could be beneficial. Our code and data are available at https://github.com/kilicogluh/CONSORT-TM/tree/master/weakSupervision.

摘要

缺乏大量带注释的数据是开发有效的生物医学文献文本挖掘模型的主要障碍。在这项研究中，我们探索了弱监督，以提高评估随机对照试验 (RCT) 出版物方法透明度的文本分类模型的准确性。具体来说，我们使用了 Snorkel，这是一个用于编程构建训练集的框架，以及 UMLS-EDA，这是一种数据增强方法，它利用少量标记的示例来生成新的训练实例，并评估它们对之前工作中提出的用于该任务的基于 BioBERT 的文本分类模型的影响。由于弱监督而导致的性能改进是有限的，并且被超参数调整带来的收益所超越。我们的分析表明，改进弱监督策略以更好地处理多标签情况可能会有所帮助。我们的代码和数据可在 https://github.com/kilicogluh/CONSORT-TM/tree/master/weakSupervision 上获得。