School of Computer Science and Technology, Dalian University of Technology, Dalian, China.
Proteome Sci. 2013 Nov 7;11(Suppl 1):S17. doi: 10.1186/1477-5956-11-S1-S17.
Biomedical extraction based on supervised machine learning still faces the problem that a limited labeled dataset does not saturate the learning method. Many supervised learning algorithms for bio-event extraction have been affected by the data sparseness.
In this study, a semi-supervised method for combining labeled data with large scale of unlabeled data is presented to improve the performance of biomedical event extraction. We propose a set of rich feature vector, including a variety of syntactic features and semantic features, such as N-gram features, walk subsequence features, predicate argument structure (PAS) features, especially some new features derived from a strategy named Event Feature Coupling Generalization (EFCG). The EFCG algorithm can create useful event recognition features by making use of the correlation between two sorts of original features explored from the labeled data, while the correlation is computed with the help of massive amounts of unlabeled data. This introduced EFCG approach aims to solve the data sparse problem caused by limited tagging corpus, and enables the new features to cover much more event related information with better generalization properties.
The effectiveness of our event extraction system is evaluated on the datasets from the BioNLP Shared Task 2011 and PubMed. Experimental results demonstrate the state-of-the-art performance in the fine-grained biomedical information extraction task.
Limited labeled data could be combined with unlabeled data to tackle the data sparseness problem by means of our EFCG approach, and the classified capability of the model was enhanced through establishing a rich feature set by both labeled and unlabeled datasets. So this semi-supervised learning approach could go far towards improving the performance of the event extraction system. To the best of our knowledge, it was the first attempt at combining labeled and unlabeled data for tasks related biomedical event extraction.
基于监督机器学习的生物医学提取仍然面临着有限的标记数据集不能使学习方法饱和的问题。许多用于生物事件提取的监督学习算法都受到了数据稀疏性的影响。
在这项研究中,提出了一种将标记数据与大规模未标记数据相结合的半监督方法,以提高生物医学事件提取的性能。我们提出了一组丰富的特征向量,包括各种句法特征和语义特征,如 N 元特征、游走子序列特征、谓词谓词结构(PAS)特征,特别是一些从名为事件特征耦合泛化(EFCG)的策略中派生的新特征。EFCG 算法可以通过利用从标记数据中探索的两种原始特征之间的相关性来创建有用的事件识别特征,而相关性是通过大量未标记数据的帮助来计算的。这种引入的 EFCG 方法旨在解决由有限的标记语料库引起的数据稀疏问题,并使新特征具有更好的泛化特性,从而涵盖更多的与事件相关的信息。
我们的事件提取系统在生物自然语言处理共享任务 2011 和 PubMed 数据集上进行了评估。实验结果证明了在细粒度生物医学信息提取任务中的最新性能。
通过我们的 EFCG 方法,可以将有限的标记数据与未标记数据结合起来解决数据稀疏问题,并且通过使用标记和未标记数据集建立丰富的特征集来增强模型的分类能力。因此,这种半监督学习方法可以大大提高事件提取系统的性能。据我们所知,这是首次尝试将标记和未标记数据结合起来用于生物医学事件提取相关任务。