Ramesh Balaji Polepalli, Yu Hong
University of Wisconsin Milwaukee, Milwaukee, WI.
AMIA Annu Symp Proc. 2010 Nov 13;2010:657-61.
Discourse connectives are words or phrases that connect or relate two coherent sentences or phrases and indicate the presence of discourse relations. Automatic recognition of discourse connectives may benefit many natural language processing applications. In this pilot study, we report the development of the supervised machine-learning classifiers with conditional random fields (CRFs) for automatically identifying discourse connectives in full-text biomedical articles. Our first classifier was trained on the open-domain 1 million token Penn Discourse Tree Bank (PDTB). We performed cross validation on biomedical articles (approximately 100K word tokens) that we annotated. The results show that the classifier trained on PDTB data attained a 0.55 F1-score for identifying discourse connectives in biomedical text, while the cross-validation results in the biomedical text attained a 0.69 F1-score, a much better performance despite a much smaller training size. Our preliminary analysis suggests the existence of domain-specific features, and we speculate that domain-adaption approaches may further improve performance.
话语连接词是连接或关联两个连贯句子或短语并表明话语关系存在的单词或短语。自动识别话语连接词可能会使许多自然语言处理应用受益。在这项初步研究中,我们报告了使用条件随机场(CRF)开发的监督式机器学习分类器,用于在全文生物医学文章中自动识别话语连接词。我们的第一个分类器是在开放域的100万个词元的宾州话语树库(PDTB)上训练的。我们对自己标注的生物医学文章(约10万个词元)进行了交叉验证。结果表明,在PDTB数据上训练的分类器在识别生物医学文本中的话语连接词时获得了0.55的F1分数,而在生物医学文本中的交叉验证结果获得了0.69的F1分数,尽管训练规模小得多,但性能要好得多。我们的初步分析表明存在特定领域的特征,并且我们推测领域适应方法可能会进一步提高性能。