Suppr超能文献

基于无监督SapBERT的双编码器,用于使用SNOMED CT对临床叙述进行医学概念注释。

Unsupervised SapBERT-based bi-encoders for medical concept annotation of clinical narratives with SNOMED CT.

作者信息

Abdulnazar Akhila, Roller Roland, Schulz Stefan, Kreuzthaler Markus

机构信息

Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria.

CBmed GmbH - Center for Biomarker Research in Medicine, Graz, Austria.

出版信息

Digit Health. 2024 Oct 21;10:20552076241288681. doi: 10.1177/20552076241288681. eCollection 2024 Jan-Dec.

Abstract

OBJECTIVE

Clinical narratives provide comprehensive patient information. Achieving interoperability involves mapping relevant details to standardized medical vocabularies. Typically, natural language processing divides this task into named entity recognition (NER) and medical concept normalization (MCN). State-of-the-art results require supervised setups with abundant training data. However, the limited availability of annotated data due to sensitivity and time constraints poses challenges. This study addressed the need for unsupervised medical concept annotation (MCA) to overcome these limitations and support the creation of annotated datasets.

METHOD

We use an unsupervised SapBERT-based bi-encoder model to analyze n-grams from narrative text and measure their similarity to SNOMED CT concepts. At the end, we apply a syntactical re-ranker. For evaluation, we use the semantic tags of SNOMED CT candidates to assess the NER phase and their concept IDs to assess the MCN phase. The approach is evaluated with both English and German narratives.

RESULT

Without training data, our unsupervised approach achieves an F1 score of 0.765 in English and 0.557 in German for MCN. Evaluation at the semantic tag level reveals that "disorder" has the highest F1 scores, 0.871 and 0.648 on English and German datasets. Furthermore, the MCA approach on the semantic tag "disorder" shows F1 scores of 0.839 and 0.696 in English and 0.685 and 0.437 in German for NER and MCN, respectively.

CONCLUSION

This unsupervised approach demonstrates potential for initial annotation (pre-labeling) in manual annotation tasks. While promising for certain semantic tags, challenges remain, including false positives, contextual errors, and variability of clinical language, requiring further fine-tuning.

摘要

目的

临床叙述提供了全面的患者信息。实现互操作性涉及将相关细节映射到标准化医学词汇表。通常,自然语言处理将此任务分为命名实体识别(NER)和医学概念规范化(MCN)。最先进的结果需要有大量训练数据的监督设置。然而,由于敏感性和时间限制,带注释数据的可用性有限带来了挑战。本研究满足了对无监督医学概念注释(MCA)的需求,以克服这些限制并支持带注释数据集的创建。

方法

我们使用基于无监督SapBERT的双编码器模型来分析叙述文本中的n元语法,并测量它们与SNOMED CT概念的相似度。最后,我们应用一个句法重排器。为了进行评估,我们使用SNOMED CT候选词的语义标签来评估NER阶段,使用它们的概念ID来评估MCN阶段。该方法在英语和德语叙述文本上进行了评估。

结果

在没有训练数据的情况下,我们的无监督方法在MCN方面,英语的F1分数为0.765,德语的F1分数为0.557。在语义标签级别进行评估时发现,“疾病”的F1分数最高,在英语和德语数据集上分别为0.871和0.648。此外,在语义标签“疾病”上的MCA方法在NER和MCN方面,英语的F1分数分别为0.839和0.696,德语的F1分数分别为0.685和0.437。

结论

这种无监督方法在手动注释任务的初始注释(预标记)方面显示出潜力。虽然对某些语义标签很有前景,但挑战仍然存在,包括误报、上下文错误和临床语言的变异性,需要进一步微调。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f5d2/11531008/17fafd3054de/10.1177_20552076241288681-fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验