Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul;2022:4445-4448. doi: 10.1109/EMBC48229.2022.9871846.
Biomedical text classification requires having training examples labeled by clinical specialists, a process that can be costly. To address this problem, active learning incrementally selects a subset of the most informative unlabeled examples, samples that are then labeled and used to train a given classifier, seeking to reduce the number of labeled samples. Nonetheless, the other unlabeled examples are not used by active learning, but incorporating semi-supervised techniques that use unlabeled samples could improve the representativeness of the data and the discriminatory power of the classifiers. This work proposes a generic semi-supervised learning framework for improving active learning and reducing the number of labeled training examples in biomedical text classification. The proposed framework combines manually annotated training examples selected by active learning and pseudo-labels obtained from a trained classifier. To evaluate the proposed framework, three biomedical datasets with textual information on obesity and smoking habit were used across different classification algorithms. The classification results show that the proposed framework can reduce the number of training examples that are manually labeled by clinical specialists by a 10% without affecting the performance of the classifiers. This performance is attributable to the ability of the classifiers to correctly select and label the training examples. Clinical relevance- We demonstrate the effectiveness of the proposed semi-supervised learning framework to reduce manual labeling efforts of biomedical texts by clinical specialists for the training of classifiers.
生物医学文本分类需要具有由临床专家标记的训练示例,这是一个昂贵的过程。为了解决这个问题,主动学习逐步选择信息量最大的未标记示例子集,然后对这些样本进行标记并用于训练给定的分类器,以减少标记样本的数量。尽管如此,主动学习并未使用其他未标记的示例,但结合使用未标记示例的半监督技术可以提高数据的代表性和分类器的判别能力。这项工作提出了一种通用的半监督学习框架,用于改进主动学习并减少生物医学文本分类中标记训练示例的数量。所提出的框架结合了通过主动学习选择的手动注释训练示例和从经过训练的分类器获得的伪标签。为了评估所提出的框架,使用了三个具有肥胖和吸烟习惯文本信息的生物医学数据集,涉及不同的分类算法。分类结果表明,所提出的框架可以在不影响分类器性能的情况下,将临床专家手动标记的训练示例数量减少 10%。这种性能归因于分类器正确选择和标记训练示例的能力。临床相关性-我们证明了所提出的半监督学习框架在减少临床专家对分类器训练的生物医学文本的手动标记工作方面的有效性。