Suppr超能文献

Transformer 在临床文本分类上的局限性。

Limitations of Transformers on Clinical Text Classification.

出版信息

IEEE J Biomed Health Inform. 2021 Sep;25(9):3596-3607. doi: 10.1109/JBHI.2021.3062322. Epub 2021 Sep 3.

Abstract

Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods to scale BERT, which by default can only handle input sequences up to approximately 400 words long, to perform document classification on clinical texts several thousand words long. We compare these methods against two much simpler architectures - a word-level convolutional neural network and a hierarchical self-attention network - and show that BERT often cannot beat these simpler baselines when classifying MIMIC-III discharge summaries and SEER cancer pathology reports. In our analysis, we show that two key components of BERT - pretraining and WordPiece tokenization - may actually be inhibiting BERT's performance on clinical text classification tasks where the input document is several thousand words long and where correctly identifying labels may depend more on identifying a few key words or phrases rather than understanding the contextual meaning of sequences of text.

摘要

基于转换器的双向编码器表示 (BERT) 和基于 BERT 的方法是许多自然语言处理 (NLP) 任务的当前最新技术; 然而,它们在长临床文本的文档分类中的应用受到限制。在这项工作中,我们引入了四种扩展 BERT 的方法,BERT 默认只能处理长度约为 400 字的输入序列,以对数千字长的临床文本进行文档分类。我们将这些方法与两个简单得多的架构进行了比较——一个是单词级别的卷积神经网络,另一个是分层自注意网络——并表明,在对 MIMIC-III 出院总结和 SEER 癌症病理报告进行分类时,BERT 通常无法击败这些更简单的基线。在我们的分析中,我们表明 BERT 的两个关键组成部分——预训练和 WordPiece 标记化——实际上可能会抑制 BERT 在临床文本分类任务中的性能,在这些任务中,输入文档长达数千字,正确识别标签可能更多地取决于识别几个关键词或短语,而不是理解文本序列的上下文含义。

相似文献

1
Limitations of Transformers on Clinical Text Classification.Transformer 在临床文本分类上的局限性。
IEEE J Biomed Health Inform. 2021 Sep;25(9):3596-3607. doi: 10.1109/JBHI.2021.3062322. Epub 2021 Sep 3.
10
Korean clinical entity recognition from diagnosis text using BERT.基于 BERT 的韩语文本临床实体识别。
BMC Med Inform Decis Mak. 2020 Sep 30;20(Suppl 7):242. doi: 10.1186/s12911-020-01241-8.

引用本文的文献

10
Microblog discourse analysis for parenting style assessment.用于育儿风格评估的微博话语分析
Front Public Health. 2025 Feb 11;13:1505825. doi: 10.3389/fpubh.2025.1505825. eCollection 2025.

本文引用的文献

1
Classification of Cancer Pathology Reports: A Large-Scale Comparative Study.癌症病理学报告分类:一项大规模比较研究。
IEEE J Biomed Health Inform. 2020 Nov;24(11):3085-3094. doi: 10.1109/JBHI.2020.3005016. Epub 2020 Nov 4.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验