• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用日语患者博客上的分布语义学扩展医学词汇

Expansion of medical vocabularies using distributional semantics on Japanese patient blogs.

作者信息

Ahltorp Magnus, Skeppstedt Maria, Kitajima Shiho, Henriksson Aron, Rzepka Rafal, Araki Kenji

机构信息

, Stockholm, Sweden.

Department of Computer Science, Linnaeus University/Gavagai, Växjö/Stockholm, Sweden.

出版信息

J Biomed Semantics. 2016 Sep 26;7(1):58. doi: 10.1186/s13326-016-0093-x.

DOI:10.1186/s13326-016-0093-x
PMID:27671202
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5037651/
Abstract

BACKGROUND

Research on medical vocabulary expansion from large corpora has primarily been conducted using text written in English or similar languages, due to a limited availability of large biomedical corpora in most languages. Medical vocabularies are, however, essential also for text mining from corpora written in other languages than English and belonging to a variety of medical genres. The aim of this study was therefore to evaluate medical vocabulary expansion using a corpus very different from those previously used, in terms of grammar and orthographics, as well as in terms of text genre. This was carried out by applying a method based on distributional semantics to the task of extracting medical vocabulary terms from a large corpus of Japanese patient blogs.

METHODS

Distributional properties of terms were modelled with random indexing, followed by agglomerative hierarchical clustering of 3 ×100 seed terms from existing vocabularies, belonging to three semantic categories: Medical Finding, Pharmaceutical Drug and Body Part. By automatically extracting unknown terms close to the centroids of the created clusters, candidates for new terms to include in the vocabulary were suggested. The method was evaluated for its ability to retrieve the remaining n terms in existing medical vocabularies.

RESULTS

Removing case particles and using a context window size of 1+1 was a successful strategy for Medical Finding and Pharmaceutical Drug, while retaining case particles and using a window size of 8+8 was better for Body Part. For a 10n long candidate list, the use of different cluster sizes affected the result for Pharmaceutical Drug, while the effect was only marginal for the other two categories. For a list of top n candidates for Body Part, however, clusters with a size of up to two terms were slightly more useful than larger clusters. For Pharmaceutical Drug, the best settings resulted in a recall of 25 % for a candidate list of top n terms and a recall of 68 % for top 10n. For a candidate list of top 10n candidates, the second best results were obtained for Medical Finding: a recall of 58 %, compared to 46 % for Body Part. Only taking the top n candidates into account, however, resulted in a recall of 23 % for Body Part, compared to 16 % for Medical Finding.

CONCLUSIONS

Different settings for corpus pre-processing, window sizes and cluster sizes were suitable for different semantic categories and for different lengths of candidate lists, showing the need to adapt parameters, not only to the language and text genre used, but also to the semantic category for which the vocabulary is to be expanded. The results show, however, that the investigated choices for pre-processing and parameter settings were successful, and that a Japanese blog corpus, which in many ways differs from those used in previous studies, can be a useful resource for medical vocabulary expansion.

摘要

背景

由于大多数语言中大型生物医学语料库的可用性有限,从大型语料库中扩展医学词汇的研究主要是使用英文或类似语言编写的文本进行的。然而,医学词汇对于从非英语且属于各种医学体裁的语料库中进行文本挖掘也至关重要。因此,本研究的目的是使用一个在语法、正字法以及文本体裁方面与先前使用的语料库截然不同的语料库来评估医学词汇的扩展。这是通过将基于分布语义学的方法应用于从大量日本患者博客语料库中提取医学词汇术语的任务来实现的。

方法

使用随机索引对术语的分布属性进行建模,然后对来自现有词汇表的3×100个种子术语进行凝聚层次聚类,这些种子术语属于三个语义类别:医学发现、药物和身体部位。通过自动提取靠近创建聚类中心的未知术语,提出了要纳入词汇表的新术语候选词。该方法针对其检索现有医学词汇表中其余n个术语的能力进行了评估。

结果

对于医学发现和药物类别,去除格助词并使用1+1的上下文窗口大小是一种成功的策略,而对于身体部位类别,保留格助词并使用8+8的窗口大小效果更好。对于长度为10n的候选词列表,不同的聚类大小对药物类别有影响,而对其他两个类别影响较小。然而,对于身体部位的前n个候选词列表,大小最多为两个术语的聚类比更大的聚类略有用。对于药物类别,最佳设置对于前n个术语的候选词列表召回率为25%,对于前10n个召回率为68%。对于前10n个候选词列表,医学发现类别获得了第二好的结果:召回率为58%,而身体部位类别为46%。然而,仅考虑前n个候选词时,身体部位类别的召回率为23%,医学发现类别为16%。

结论

语料库预处理、窗口大小和聚类大小的不同设置适用于不同的语义类别和不同长度的候选词列表,这表明不仅要根据所使用的语言和文本体裁,还要根据要扩展词汇表的语义类别来调整参数。然而,结果表明所研究的预处理和参数设置选择是成功的,并且一个在许多方面与先前研究中使用的语料库不同的日本博客语料库可以成为医学词汇扩展的有用资源。

相似文献

1
Expansion of medical vocabularies using distributional semantics on Japanese patient blogs.利用日语患者博客上的分布语义学扩展医学词汇
J Biomed Semantics. 2016 Sep 26;7(1):58. doi: 10.1186/s13326-016-0093-x.
2
Synonym extraction and abbreviation expansion with ensembles of semantic spaces.使用语义空间集合进行同义词提取和缩写扩展。
J Biomed Semantics. 2014 Feb 5;5(1):6. doi: 10.1186/2041-1480-5-6.
3
Corpus domain effects on distributional semantic modeling of medical terms.语料库领域对医学术语分布语义建模的影响。
Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.
4
Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: an annotation and machine learning study.从临床文本中自动识别疾病、检查结果、药物和身体结构:一项注释与机器学习研究。
J Biomed Inform. 2014 Jun;49:148-58. doi: 10.1016/j.jbi.2014.01.012. Epub 2014 Feb 4.
5
Consumers' Use of UMLS Concepts on Social Media: Diabetes-Related Textual Data Analysis in Blog and Social Q&A Sites.消费者在社交媒体上对统一医学语言系统(UMLS)概念的使用:博客和社交问答网站中与糖尿病相关的文本数据分析
JMIR Med Inform. 2016 Nov 24;4(4):e41. doi: 10.2196/medinform.5748.
6
Automatic extraction of candidate nomenclature terms using the doublet method.使用双重方法自动提取候选命名术语。
BMC Med Inform Decis Mak. 2005 Oct 18;5:35. doi: 10.1186/1472-6947-5-35.
7
The Semantic Organization of the English Odor Vocabulary.英语气味词汇的语义组织。
Cogn Sci. 2022 Nov;46(11):e13205. doi: 10.1111/cogs.13205.
8
Expanding a radiology lexicon using contextual patterns in radiology reports.利用放射科报告中的上下文模式扩展放射学词汇。
J Am Med Inform Assoc. 2018 Jun 1;25(6):679-685. doi: 10.1093/jamia/ocx152.
9
The role of corpus size and syntax in deriving lexico-semantic representations for a wide range of concepts.语料库规模和句法在推导广泛概念的词汇语义表征中的作用。
Q J Exp Psychol (Hove). 2015;68(8):1643-64. doi: 10.1080/17470218.2014.994098. Epub 2015 Feb 26.
10
Improving Consumer Understanding of Medical Text: Development and Validation of a New SubSimplify Algorithm to Automatically Generate Term Explanations in English and Spanish.提高消费者对医学文本的理解:一种用于自动生成英语和西班牙语术语解释的新型SubSimplify算法的开发与验证
J Med Internet Res. 2018 Aug 2;20(8):e10779. doi: 10.2196/10779.

引用本文的文献

1
An empirical study on the teaching mode of cultural translation in college English based on the Production Oriented Approach (POA).基于产出导向法(POA)的大学英语文化翻译教学模式实证研究
PLoS One. 2025 Jun 27;20(6):e0326127. doi: 10.1371/journal.pone.0326127. eCollection 2025.
2
An Alternative Application of Natural Language Processing to Express a Characteristic Feature of Diseases in Japanese Medical Records.自然语言处理在日本医疗记录中表达疾病特征的一种新应用。
Methods Inf Med. 2023 Sep;62(3-04):110-118. doi: 10.1055/a-2039-3773. Epub 2023 Feb 21.
3
MedLexSp - a medical lexicon for Spanish medical natural language processing.

本文引用的文献

1
Identifying adverse drug event information in clinical notes with distributional semantic representations of context.利用上下文的分布语义表示识别临床记录中的药物不良事件信息。
J Biomed Inform. 2015 Oct;57:333-49. doi: 10.1016/j.jbi.2015.08.013. Epub 2015 Aug 17.
2
Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features.社交媒体中的药物警戒:使用带有词嵌入聚类特征的序列标注挖掘药物不良反应提及信息。
J Am Med Inform Assoc. 2015 May;22(3):671-81. doi: 10.1093/jamia/ocu041. Epub 2015 Mar 9.
3
Identifying synonymy between SNOMED clinical terms of varying length using distributional analysis of electronic health records.
MedLexSp- 西班牙语医学自然语言处理的医学词典。
J Biomed Semantics. 2023 Feb 2;14(1):2. doi: 10.1186/s13326-022-00281-5.
4
Affective Cognition of Students' Autonomous Learning in College English Teaching Based on Deep Learning.基于深度学习的大学英语教学中大学生自主学习的情感认知
Front Psychol. 2022 Jan 19;12:808434. doi: 10.3389/fpsyg.2021.808434. eCollection 2021.
5
Learning unsupervised contextual representations for medical synonym discovery.学习用于医学同义词发现的无监督上下文表示。
JAMIA Open. 2019 Nov 4;2(4):538-546. doi: 10.1093/jamiaopen/ooz057. eCollection 2019 Dec.
6
Clinical Natural Language Processing in languages other than English: opportunities and challenges.非英语语言的临床自然语言处理:机遇与挑战。
J Biomed Semantics. 2018 Mar 30;9(1):12. doi: 10.1186/s13326-018-0179-8.
7
A Text Structuring Method for Chinese Medical Text Based on Temporal Information.基于时间信息的中文医学文本结构方法。
Int J Environ Res Public Health. 2018 Feb 27;15(3):402. doi: 10.3390/ijerph15030402.
8
Ranking Medical Terms to Support Expansion of Lay Language Resources for Patient Comprehension of Electronic Health Record Notes: Adapted Distant Supervision Approach.对医学术语进行排序以支持扩展用于患者理解电子健康记录笔记的通俗语言资源:适应性远程监督方法。
JMIR Med Inform. 2017 Oct 31;5(4):e42. doi: 10.2196/medinform.8531.
利用电子健康记录的分布分析识别不同长度的SNOMED临床术语之间的同义关系。
AMIA Annu Symp Proc. 2013 Nov 16;2013:600-9. eCollection 2013.
4
Synonym extraction and abbreviation expansion with ensembles of semantic spaces.使用语义空间集合进行同义词提取和缩写扩展。
J Biomed Semantics. 2014 Feb 5;5(1):6. doi: 10.1186/2041-1480-5-6.
5
Unsupervised biomedical named entity recognition: experiments with clinical and biological texts.无监督生物医学命名实体识别:临床和生物文本实验。
J Biomed Inform. 2013 Dec;46(6):1088-98. doi: 10.1016/j.jbi.2013.08.004. Epub 2013 Aug 15.
6
Towards comprehensive syntactic and semantic annotations of the clinical narrative.朝着临床叙述的全面句法和语义标注努力。
J Am Med Inform Assoc. 2013 Sep-Oct;20(5):922-30. doi: 10.1136/amiajnl-2012-001317. Epub 2013 Jan 25.
7
Improving perceived and actual text difficulty for health information consumers using semi-automated methods.使用半自动方法提高健康信息消费者对文本难度的感知及实际文本难度。
AMIA Annu Symp Proc. 2012;2012:522-31. Epub 2012 Nov 3.
8
Landscape of international event-based biosurveillance.基于事件的国际生物监测概况。
Emerg Health Threats J. 2010;3:e3. doi: 10.3134/ehtj.10.003. Epub 2010 Feb 19.
9
Enhancing clinical concept extraction with distributional semantics.利用分布语义增强临床概念提取。
J Biomed Inform. 2012 Feb;45(1):129-40. doi: 10.1016/j.jbi.2011.10.007. Epub 2011 Nov 7.
10
Using electronic patient records to discover disease correlations and stratify patient cohorts.利用电子病历发现疾病相关性并对患者队列进行分层。
PLoS Comput Biol. 2011 Aug;7(8):e1002141. doi: 10.1371/journal.pcbi.1002141. Epub 2011 Aug 25.