Human Language Technology Research Institute, The University of Texas at Dallas, Richardson, Texas, USA.
J Am Med Inform Assoc. 2013 Sep-Oct;20(5):867-75. doi: 10.1136/amiajnl-2013-001619. Epub 2013 May 18.
To provide a natural language processing method for the automatic recognition of events, temporal expressions, and temporal relations in clinical records.
A combination of supervised, unsupervised, and rule-based methods were used. Supervised methods include conditional random fields and support vector machines. A flexible automated feature selection technique was used to select the best subset of features for each supervised task. Unsupervised methods include Brown clustering on several corpora, which result in our method being considered semisupervised.
On the 2012 Informatics for Integrating Biology and the Bedside (i2b2) shared task data, we achieved an overall event F1-measure of 0.8045, an overall temporal expression F1-measure of 0.6154, an overall temporal link detection F1-measure of 0.5594, and an end-to-end temporal link detection F1-measure of 0.5258. The most competitive system was our event recognition method, which ranked third out of the 14 participants in the event task.
Analysis reveals the event recognition method has difficulty determining which modifiers to include/exclude in the event span. The temporal expression recognition method requires significantly more normalization rules, although many of these rules apply only to a small number of cases. Finally, the temporal relation recognition method requires more advanced medical knowledge and could be improved by separating the single discourse relation classifier into multiple, more targeted component classifiers.
Recognizing events and temporal expressions can be achieved accurately by combining supervised and unsupervised methods, even when only minimal medical knowledge is available. Temporal normalization and temporal relation recognition, however, are far more dependent on the modeling of medical knowledge.
提供一种自然语言处理方法,用于自动识别临床记录中的事件、时间表达式和时间关系。
采用有监督、无监督和基于规则的方法相结合。有监督方法包括条件随机场和支持向量机。采用灵活的自动特征选择技术,为每个监督任务选择最佳特征子集。无监督方法包括对多个语料库进行 Brown 聚类,这使得我们的方法被认为是半监督的。
在 2012 年的生物信息学与床边信息学(i2b2)共享任务数据中,我们实现了整体事件 F1 度量值为 0.8045、整体时间表达式 F1 度量值为 0.6154、整体时间链接检测 F1 度量值为 0.5594 和端到端时间链接检测 F1 度量值为 0.5258。最具竞争力的系统是我们的事件识别方法,在事件任务的 14 名参与者中排名第三。
分析表明,事件识别方法难以确定在事件跨度中包含/排除哪些修饰符。时间表达式识别方法需要更多的规范化规则,尽管其中许多规则仅适用于少数情况。最后,时间关系识别方法需要更先进的医学知识,并且可以通过将单个话语关系分类器分离成多个更有针对性的组件分类器来改进。
即使只有很少的医学知识,也可以通过结合有监督和无监督方法准确地识别事件和时间表达式。然而,时间规范化和时间关系识别在很大程度上依赖于医学知识的建模。