Computer Laboratory, University of Cambridge, Cambridge, UK.
Bioinformatics. 2013 Jun 1;29(11):1440-7. doi: 10.1093/bioinformatics/btt163. Epub 2013 Apr 5.
Techniques that are capable of automatically analyzing the information structure of scientific articles could be highly useful for improving information access to biomedical literature. However, most existing approaches rely on supervised machine learning (ML) and substantial labeled data that are expensive to develop and apply to different sub-fields of biomedicine. Recent research shows that minimal supervision is sufficient for fairly accurate information structure analysis of biomedical abstracts. However, is it realistic for full articles given their high linguistic and informational complexity? We introduce and release a novel corpus of 50 biomedical articles annotated according to the Argumentative Zoning (AZ) scheme, and investigate active learning with one of the most widely used ML models-Support Vector Machines (SVM)-on this corpus. Additionally, we introduce two novel applications that use AZ to support real-life literature review in biomedicine via question answering and summarization.
We show that active learning with SVM trained on 500 labeled sentences (6% of the corpus) performs surprisingly well with the accuracy of 82%, just 2% lower than fully supervised learning. In our question answering task, biomedical researchers find relevant information significantly faster from AZ-annotated than unannotated articles. In the summarization task, sentences extracted from particular zones are significantly more similar to gold standard summaries than those extracted from particular sections of full articles. These results demonstrate that active learning of full articles' information structure is indeed realistic and the accuracy is high enough to support real-life literature review in biomedicine.
The annotated corpus, our AZ classifier and the two novel applications are available at http://www.cl.cam.ac.uk/yg244/12bioinfo.html
能够自动分析科学文章信息结构的技术对于改善生物医学文献的信息访问可能非常有用。然而,大多数现有的方法依赖于监督机器学习 (ML) 和大量的标记数据,这些数据开发和应用于生物医学的不同子领域都非常昂贵。最近的研究表明,对于生物医学摘要的信息结构分析,最小监督就足够了。但是,对于语言和信息都非常复杂的全文来说,这是否现实呢?我们引入并发布了一个新的生物医学文章语料库,其中 50 篇文章根据论证分区 (AZ) 方案进行了注释,并在该语料库上研究了最广泛使用的机器学习模型之一 - 支持向量机 (SVM) 的主动学习。此外,我们引入了两个新的应用程序,它们使用 AZ 通过问答和摘要来支持生物医学领域的实际文献综述。
我们表明,使用在 500 个标记句子(语料库的 6%)上训练的 SVM 进行主动学习的效果非常好,准确率为 82%,仅比完全监督学习低 2%。在我们的问答任务中,生物医学研究人员从 AZ 注释的文章中比未注释的文章中更快地找到相关信息。在摘要任务中,从特定区域提取的句子与黄金标准摘要的相似度明显高于从全文特定部分提取的句子。这些结果表明,对全文信息结构的主动学习确实是可行的,并且准确性足以支持生物医学领域的实际文献综述。
注释语料库、我们的 AZ 分类器和两个新应用程序可在 http://www.cl.cam.ac.uk/yg244/12bioinfo.html 上获得。