Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing China.
Department of Statistics, University of Michigan, Ann Arbor, MI, USA.
J Biomed Inform. 2020 Oct;110:103542. doi: 10.1016/j.jbi.2020.103542. Epub 2020 Aug 24.
This study aims at realizing unsupervised term discovery in Chinese electronic health records (EHRs) by using the word segmentation technique. The existing supervised algorithms do not perform satisfactorily in the case of EHRs, as annotated medical data are scarce. We propose an unsupervised segmentation method (GTS) based on the graph partition principle, whose multi-granular segmentation capability can help realize efficient term discovery.
A sentence is converted to an undirected graph, with the edge weights based on n-gram statistics, and ratio cut is used to split the sentence into words. The graph partition is solved efficiently via dynamic programming, and multi-granularity is realized by setting different partition numbers. A BERT-based discriminator is trained using generated samples to verify the correctness of the word boundaries. The words that are not recorded in existing dictionaries are retained as potential medical terms.
We compared the GTS approach with mature segmentation systems for both word segmentation and term discovery. MD students manually segmented Chinese EHRs at fine and coarse granularity levels and reviewed the term discovery results. The proposed unsupervised method outperformed all the competing algorithms in the word segmentation task. In term discovery, GTS outperformed the best baseline by 17 percentage points (a 47% relative percentage of increment) on F1-score.
In the absence of annotated training data, the graph partition technique can effectively use the corpus statistics and even expert knowledge to realize unsupervised word segmentation of EHRs. Multi-granular segmentation can be used to provide potential medical terms of various lengths with high accuracy.
本研究旨在利用分词技术实现中文电子病历(EHR)的无监督术语发现。由于标注的医疗数据稀缺,现有的监督算法在 EHR 情况下表现不佳。我们提出了一种基于图划分原理的无监督分割方法(GTS),其多粒度分割能力有助于实现高效的术语发现。
将句子转换为无向图,边权重基于 n-gram 统计,使用比率切割将句子分割成单词。通过动态规划有效地解决图划分问题,并通过设置不同的划分数来实现多粒度。使用生成的样本训练基于 BERT 的鉴别器,以验证词边界的正确性。将未记录在现有字典中的单词保留为潜在的医学术语。
我们将 GTS 方法与成熟的分词系统进行了比较,分别用于分词和术语发现。医学专业的学生在精细和粗糙粒度级别上手动对中文 EHR 进行分词,并对术语发现结果进行了审查。在分词任务中,所提出的无监督方法优于所有竞争算法。在术语发现方面,GTS 在 F1 得分上比最佳基线高出 17 个百分点(相对百分比增量为 47%)。
在缺乏标注训练数据的情况下,图划分技术可以有效地利用语料库统计信息甚至专家知识来实现 EHR 的无监督分词。多粒度分割可用于以高精度提供各种长度的潜在医学术语。