使用社区参与研究协议对小数据集进行分类的基于注意力的模型：分类系统开发与验证试点研究

Attention-Based Models for Classifying Small Data Sets Using Community-Engaged Research Protocols: Classification System Development and Validation Pilot Study.

作者信息

Ferrell Brian J, Raskin Sarah E, Zimmerman Emily B, Timberline David H, McInnes Bridget T, Krist Alex H

机构信息

Center for Community Engagement and Impact, Virginia Commonwealth University, Richmond, VA, United States.

L Douglas Wilder School of Government and Public Affairs, Virginia Commonwealth University, Richmond, VA, United States.

出版信息

JMIR Form Res. 2022 Sep 6;6(9):e32460. doi: 10.2196/32460.

DOI:10.2196/32460

PMID:36066925

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9490525/

Abstract

BACKGROUND

Community-engaged research (CEnR) is a research approach in which scholars partner with community organizations or individuals with whom they share an interest in the study topic, typically with the goal of supporting that community's well-being. CEnR is well-established in numerous disciplines including the clinical and social sciences. However, universities experience challenges reporting comprehensive CEnR metrics, limiting the development of appropriate CEnR infrastructure and the advancement of relationships with communities, funders, and stakeholders.

OBJECTIVE

We propose a novel approach to identifying and categorizing community-engaged studies by applying attention-based deep learning models to human participants protocols that have been submitted to the university's institutional review board (IRB).

METHODS

We manually classified a sample of 280 protocols submitted to the IRB using a 3- and 6-level CEnR heuristic. We then trained an attention-based bidirectional long short-term memory unit (Bi-LSTM) on the classified protocols and compared it to transformer models such as Bidirectional Encoder Representations From Transformers (BERT), Bio + Clinical BERT, and Cross-lingual Language Model-Robustly Optimized BERT Pre-training Approach (XLM-RoBERTa). We applied the best-performing models to the full sample of unlabeled IRB protocols submitted in the years 2013-2019 (n>6000).

RESULTS

Although transfer learning is superior, receiving a 0.9952 evaluation F1 score for all transformer models implemented compared to the attention-based Bi-LSTM (between 48%-80%), there were key issues with overfitting. This finding is consistent across several methodological adjustments: an augmented data set with and without cross-validation, an unaugmented data set with and without cross-validation, a 6-class CEnR spectrum, and a 3-class one.

CONCLUSIONS

Transfer learning is a more viable method than the attention-based bidirectional-LSTM for differentiating small data sets characterized by the idiosyncrasies and variability of CEnR descriptions used by principal investigators in research protocols. Despite these issues involving overfitting, BERT and the other transformer models remarkably showed an understanding of our data unlike the attention-based Bi-LSTM model, promising a more realistic path toward solving this real-world application.

摘要

背景

社区参与研究（CEnR）是一种研究方法，学者们与社区组织或个人合作，他们对研究主题有着共同的兴趣，通常旨在促进该社区的福祉。CEnR在包括临床和社会科学在内的众多学科中已得到确立。然而，大学在报告全面的CEnR指标方面面临挑战，这限制了适当的CEnR基础设施的发展以及与社区、资助者和利益相关者关系的推进。

目的

我们提出一种新颖的方法，通过将基于注意力的深度学习模型应用于已提交给大学机构审查委员会（IRB）的人类受试者方案，来识别和分类社区参与研究。

方法

我们使用3级和6级CEnR启发式方法对手动分类的280个提交给IRB的方案样本进行分类。然后，我们在分类后的方案上训练基于注意力的双向长短期记忆单元（Bi-LSTM），并将其与诸如来自变换器的双向编码器表示（BERT）、生物+临床BERT以及跨语言语言模型-稳健优化的BERT预训练方法（XLM-RoBERTa）等变换器模型进行比较。我们将性能最佳的模型应用于2013 - 2019年提交的未标记IRB方案的完整样本（n>6000）。

结果

尽管迁移学习更具优势，与基于注意力的Bi-LSTM相比（介于48% - 80%之间），所有实施的变换器模型的评估F1分数为0.9952，但存在过度拟合的关键问题。这一发现在多种方法调整中是一致的：有交叉验证和无交叉验证的增强数据集、有交叉验证和无交叉验证的未增强数据集、6类CEnR频谱以及3类CEnR频谱。