Medlock Ben
University of Cambridge, Computer Laboratory, William Gates Building, 15 JJ Thomson Avenue, Cambridge CB3OFD, UK.
J Biomed Inform. 2008 Aug;41(4):636-54. doi: 10.1016/j.jbi.2008.01.001. Epub 2008 Jan 11.
We investigate automatic identification of speculative language, or 'hedging', in scientific literature from the biomedical domain. Our contributions include a precise description of the task including annotation guidelines, theoretical analysis and discussion. We show that good agreement can be achieved using our guidelines and present a publicly available benchmark dataset for the task. We argue for separation of the acquisition and classification phases in semi-supervised machine learning, and present a probabilistic acquisition model which is evaluated both theoretically and experimentally. We explore the impact of different sample representations on classification accuracy across the learning curve and demonstrate the effectiveness of using machine learning for the hedge identification task. Finally, we examine the errors made by our approach and point toward avenues for future research.
我们研究生物医学领域科学文献中推测性语言或“模糊限制语”的自动识别。我们的贡献包括对该任务的精确描述,包括注释指南、理论分析和讨论。我们表明,使用我们的指南可以达成良好的一致性,并为该任务提供了一个公开可用的基准数据集。我们主张在半监督机器学习中将采集和分类阶段分开,并提出一种概率采集模型,该模型在理论和实验上都经过了评估。我们探讨了不同样本表示在学习曲线上对分类准确性的影响,并证明了使用机器学习进行模糊限制语识别任务的有效性。最后,我们检查了我们方法所产生的错误,并指出了未来研究的方向。