Tang Buzhou, Chen Qingcai, Wang Xiaolong, Wu Yonghui, Zhang Yaoyun, Jiang Min, Wang Jingqi, Xu Hua
Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China; School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA.
Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China.
AMIA Annu Symp Proc. 2015 Nov 5;2015:1184-93. eCollection 2015.
Clinical concept recognition (CCR) is a fundamental task in clinical natural language processing (NLP) field. Almost all current machine learning-based CCR systems can only recognize clinical concepts of consecutive words (called consecutive clinical concepts, CCCs), but can do nothing about clinical concepts of disjoint words (called disjoint clinical concepts, DCCs), which widely exist in clinical text. In this paper, we proposed two novel types of representations for disjoint clinical concepts, and applied two state-of-the-art machine learning methods to recognizing consecutive and disjoint concepts. Experiments conducted on the 2013 ShARe/CLEF challenge corpus showed that our best system achieved a "strict" F-measure of 0.803 for CCCs, a "strict" F-measure of 0.477 for DCCs, and a "strict" F-measure of 0.783 for all clinical concepts, significantly higher than the baseline systems by 4.2% and 4.1% respectively.
临床概念识别(CCR)是临床自然语言处理(NLP)领域的一项基础任务。几乎所有当前基于机器学习的CCR系统都只能识别连续单词的临床概念(称为连续临床概念,CCCs),但对于临床文本中广泛存在的不连续单词的临床概念(称为不连续临床概念,DCCs)却无能为力。在本文中,我们提出了两种用于不连续临床概念的新型表示方法,并应用两种最先进的机器学习方法来识别连续和不连续概念。在2013年ShARe/CLEF挑战语料库上进行的实验表明,我们最好的系统在CCCs上的“严格”F值为0.803,在DCCs上的“严格”F值为0.477,在所有临床概念上的“严格”F值为0.783,分别比基线系统显著高出4.2%和4.1%。