Jiang Min, Huang Yang, Fan Jung-wei, Tang Buzhou, Denny Josh, Xu Hua
BMC Med Inform Decis Mak. 2015;15 Suppl 1(Suppl 1):S2. doi: 10.1186/1472-6947-15-S1-S2. Epub 2015 May 20.
Parsing, which generates a syntactic structure of a sentence (a parse tree), is a critical component of natural language processing (NLP) research in any domain including medicine. Although parsers developed in the general English domain, such as the Stanford parser, have been applied to clinical text, there are no formal evaluations and comparisons of their performance in the medical domain.
In this study, we investigated the performance of three state-of-the-art parsers: the Stanford parser, the Bikel parser, and the Charniak parser, using following two datasets: (1) A Treebank containing 1,100 sentences that were randomly selected from progress notes used in the 2010 i2b2 NLP challenge and manually annotated according to a Penn Treebank based guideline; and (2) the MiPACQ Treebank, which is developed based on pathology notes and clinical notes, containing 13,091 sentences. We conducted three experiments on both datasets. First, we measured the performance of the three state-of-the-art parsers on the clinical Treebanks with their default settings. Then we re-trained the parsers using the clinical Treebanks and evaluated their performance using the 10-fold cross validation method. Finally we re-trained the parsers by combining the clinical Treebanks with the Penn Treebank.
Our results showed that the original parsers achieved lower performance in clinical text (Bracketing F-measure in the range of 66.6%-70.3%) compared to general English text. After retraining on the clinical Treebank, all parsers achieved better performance, with the best performance from the Stanford parser that reached the highest Bracketing F-measure of 73.68% on progress notes and 83.72% on the MiPACQ corpus using 10-fold cross validation. When the combined clinical Treebanks and Penn Treebank was used, of the three parsers, the Charniak parser achieved the highest Bracketing F-measure of 73.53% on progress notes and the Stanford parser reached the highest F-measure of 84.15% on the MiPACQ corpus.
Our study demonstrates that re-training using clinical Treebanks is critical for improving general English parsers' performance on clinical text, and combining clinical and open domain corpora might achieve optimal performance for parsing clinical text.
句法分析用于生成句子的句法结构(句法剖析树),是包括医学领域在内的任何领域的自然语言处理(NLP)研究的关键组成部分。尽管在通用英语领域开发的句法分析器,如斯坦福句法分析器,已应用于临床文本,但尚无对其在医学领域性能的正式评估和比较。
在本研究中,我们使用以下两个数据集研究了三种最先进的句法分析器的性能:斯坦福句法分析器、比克尔句法分析器和查尔尼亚克句法分析器:(1)一个树库,包含从2010年i2b2 NLP挑战赛中使用的病程记录中随机选择的1100个句子,并根据基于宾州树库的指南进行了人工标注;(2)MiPACQ树库,它基于病理记录和临床记录开发,包含13091个句子。我们在这两个数据集上进行了三项实验。首先,我们在默认设置下测量了三种最先进的句法分析器在临床树库上的性能。然后,我们使用临床树库对句法分析器进行重新训练,并使用10折交叉验证方法评估它们的性能。最后,我们通过将临床树库与宾州树库相结合来重新训练句法分析器。
我们的结果表明,与通用英语文本相比,原始句法分析器在临床文本中的性能较低(括号F值在66.6%-70.3%之间)。在临床树库上重新训练后,所有句法分析器的性能都有所提高,其中斯坦福句法分析器表现最佳,在病程记录上使用10折交叉验证达到了最高的括号F值73.68%,在MiPACQ语料库上达到了83.72%。当使用临床树库和宾州树库的组合时,在三种句法分析器中,查尔尼亚克句法分析器在病程记录上达到了最高的括号F值73.53%,斯坦福句法分析器在MiPACQ语料库上达到了最高的F值84.15%。
我们的研究表明,使用临床树库进行重新训练对于提高通用英语句法分析器在临床文本上的性能至关重要,并且将临床语料库和开放领域语料库相结合可能会在句法分析临床文本时实现最佳性能。