基于图划分的无监督多粒度中文分词与术语发现。

Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition.

机构信息

Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing China.

Department of Statistics, University of Michigan, Ann Arbor, MI, USA.

出版信息

J Biomed Inform. 2020 Oct;110:103542. doi: 10.1016/j.jbi.2020.103542. Epub 2020 Aug 24.

DOI:10.1016/j.jbi.2020.103542

PMID:32853795

Abstract

OBJECTIVE

This study aims at realizing unsupervised term discovery in Chinese electronic health records (EHRs) by using the word segmentation technique. The existing supervised algorithms do not perform satisfactorily in the case of EHRs, as annotated medical data are scarce. We propose an unsupervised segmentation method (GTS) based on the graph partition principle, whose multi-granular segmentation capability can help realize efficient term discovery.

METHODS

A sentence is converted to an undirected graph, with the edge weights based on n-gram statistics, and ratio cut is used to split the sentence into words. The graph partition is solved efficiently via dynamic programming, and multi-granularity is realized by setting different partition numbers. A BERT-based discriminator is trained using generated samples to verify the correctness of the word boundaries. The words that are not recorded in existing dictionaries are retained as potential medical terms.

RESULTS

We compared the GTS approach with mature segmentation systems for both word segmentation and term discovery. MD students manually segmented Chinese EHRs at fine and coarse granularity levels and reviewed the term discovery results. The proposed unsupervised method outperformed all the competing algorithms in the word segmentation task. In term discovery, GTS outperformed the best baseline by 17 percentage points (a 47% relative percentage of increment) on F1-score.

CONCLUSION

In the absence of annotated training data, the graph partition technique can effectively use the corpus statistics and even expert knowledge to realize unsupervised word segmentation of EHRs. Multi-granular segmentation can be used to provide potential medical terms of various lengths with high accuracy.

摘要

目的

本研究旨在利用分词技术实现中文电子病历（EHR）的无监督术语发现。由于标注的医疗数据稀缺，现有的监督算法在 EHR 情况下表现不佳。我们提出了一种基于图划分原理的无监督分割方法（GTS），其多粒度分割能力有助于实现高效的术语发现。

方法

将句子转换为无向图，边权重基于 n-gram 统计，使用比率切割将句子分割成单词。通过动态规划有效地解决图划分问题，并通过设置不同的划分数来实现多粒度。使用生成的样本训练基于 BERT 的鉴别器，以验证词边界的正确性。将未记录在现有字典中的单词保留为潜在的医学术语。

结果

我们将 GTS 方法与成熟的分词系统进行了比较，分别用于分词和术语发现。医学专业的学生在精细和粗糙粒度级别上手动对中文 EHR 进行分词，并对术语发现结果进行了审查。在分词任务中，所提出的无监督方法优于所有竞争算法。在术语发现方面，GTS 在 F1 得分上比最佳基线高出 17 个百分点（相对百分比增量为 47%）。

结论

在缺乏标注训练数据的情况下，图划分技术可以有效地利用语料库统计信息甚至专家知识来实现 EHR 的无监督分词。多粒度分割可用于以高精度提供各种长度的潜在医学术语。

相似文献

Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition.基于图划分的无监督多粒度中文分词与术语发现。

J Biomed Inform. 2020 Oct;110:103542. doi: 10.1016/j.jbi.2020.103542. Epub 2020 Aug 24.

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text.一个用于临床文本的细粒度中文分词和词性标注语料库。

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):66. doi: 10.1186/s12911-019-0770-7.

Deep Phenotyping of Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives With a Sequence Motif Discovery Tool: Algorithm Development and Validation.利用序列基序发现工具识别表型叙述的语言模式对中国电子健康记录进行深度表型分析：算法开发与验证

J Med Internet Res. 2022 Jun 3;24(6):e37213. doi: 10.2196/37213.

Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network.基于深度神经网络的中文临床文本命名实体识别

Stud Health Technol Inform. 2015;216:624-8.

Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts.使用词和图嵌入来衡量统一医学语言系统概念之间的语义相关性。

J Am Med Inform Assoc. 2020 Oct 1;27(10):1538-1546. doi: 10.1093/jamia/ocaa136.

Leveraging Multi-source knowledge for Chinese clinical named entity recognition via relational graph convolutional network.基于关系图卷积网络的多源知识融合的中文临床命名实体识别。

J Biomed Inform. 2022 Apr;128:104035. doi: 10.1016/j.jbi.2022.104035. Epub 2022 Feb 23.

An unsupervised method for histological image segmentation based on tissue cluster level graph cut.基于组织簇级图割的无监督组织学图像分割方法。

Comput Med Imaging Graph. 2021 Oct;93:101974. doi: 10.1016/j.compmedimag.2021.101974. Epub 2021 Aug 21.

Measuring the effect of different types of unsupervised word representations on Medical Named Entity Recognition.测量不同类型无监督词表示方法对医学命名实体识别的影响。

Int J Med Inform. 2019 Sep;129:100-106. doi: 10.1016/j.ijmedinf.2019.05.022. Epub 2019 Jun 5.

On the unsupervised analysis of domain-specific Chinese texts.关于特定领域中文文本的无监督分析。

Proc Natl Acad Sci U S A. 2016 May 31;113(22):6154-9. doi: 10.1073/pnas.1516510113. Epub 2016 May 16.

Chinese medical named entity recognition based on multi-granularity semantic dictionary and multimodal tree.基于多粒度语义词典和多模态树的中文医学命名实体识别。

J Biomed Inform. 2020 Nov;111:103583. doi: 10.1016/j.jbi.2020.103583. Epub 2020 Sep 30.

引用本文的文献

Hybrid deep learning models with multi-classification investor sentiment to forecast the prices of China's leading stocks.基于多分类投资者情绪的混合深度学习模型预测中国领先股票价格。

PLoS One. 2023 Nov 27;18(11):e0294460. doi: 10.1371/journal.pone.0294460. eCollection 2023.

Multi-Task Joint Learning Model for Chinese Word Segmentation and Syndrome Differentiation in Traditional Chinese Medicine.多任务联合学习模型在中医分词和证候分类中的应用

Int J Environ Res Public Health. 2022 May 5;19(9):5601. doi: 10.3390/ijerph19095601.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于图划分的无监督多粒度中文分词与术语发现。

Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition.

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献