Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA.
BMC Bioinformatics. 2011 Jun 9;12 Suppl 3(Suppl 3):S7. doi: 10.1186/1471-2105-12-S3-S7.
Automated extraction of bibliographic data, such as article titles, author names, abstracts, and references is essential to the affordable creation of large citation databases. References, typically appearing at the end of journal articles, can also provide valuable information for extracting other bibliographic data. Therefore, parsing individual reference to extract author, title, journal, year, etc. is sometimes a necessary preprocessing step in building citation-indexing systems. The regular structure in references enables us to consider reference parsing a sequence learning problem and to study structural Support Vector Machine (structural SVM), a newly developed structured learning algorithm on parsing references.
In this study, we implemented structural SVM and used two types of contextual features to compare structural SVM with conventional SVM. Both methods achieve above 98% token classification accuracy and above 95% overall chunk-level accuracy for reference parsing. We also compared SVM and structural SVM to Conditional Random Field (CRF). The experimental results show that structural SVM and CRF achieve similar accuracies at token- and chunk-levels.
When only basic observation features are used for each token, structural SVM achieves higher performance compared to SVM since it utilizes the contextual label features. However, when the contextual observation features from neighboring tokens are combined, SVM performance improves greatly, and is close to that of structural SVM after adding the second order contextual observation features. The comparison of these two methods with CRF using the same set of binary features show that both structural SVM and CRF perform better than SVM, indicating their stronger sequence learning ability in reference parsing.
自动化提取书目数据(如文章标题、作者姓名、摘要和参考文献)对于创建可负担得起的大型引文数据库至关重要。参考文献通常出现在期刊文章的末尾,也可以为提取其他书目数据提供有价值的信息。因此,解析单个参考文献以提取作者、标题、期刊、年份等信息有时是构建引文索引系统的必要预处理步骤。参考文献中的规则结构使我们能够将参考文献解析视为序列学习问题,并研究结构支持向量机(structural SVM),这是一种新开发的用于解析参考文献的结构化学习算法。
在这项研究中,我们实现了结构 SVM,并使用了两种类型的上下文特征来比较结构 SVM 与传统 SVM。这两种方法在参考文献解析方面均实现了超过 98%的标记分类准确率和超过 95%的整体词块级准确率。我们还将 SVM 和结构 SVM 与条件随机场(CRF)进行了比较。实验结果表明,在标记级和词块级,结构 SVM 和 CRF 的准确率相当。
当仅对每个标记使用基本观察特征时,结构 SVM 比 SVM 性能更高,因为它利用了上下文标签特征。但是,当结合来自相邻标记的上下文观察特征时,SVM 的性能会大大提高,并且在添加二阶上下文观察特征后,与结构 SVM 的性能非常接近。使用相同的二进制特征集对这两种方法与 CRF 的比较表明,结构 SVM 和 CRF 的性能均优于 SVM,这表明它们在参考文献解析中具有更强的序列学习能力。