Bethard Steven, Lu Zhiyong, Martin James H, Hunter Lawrence
Computer Science Department, University of Colorado at Boulder, Boulder, CO, USA.
BMC Bioinformatics. 2008 Jun 11;9:277. doi: 10.1186/1471-2105-9-277.
Automatic semantic role labeling (SRL) is a natural language processing (NLP) technique that maps sentences to semantic representations. This technique has been widely studied in the recent years, but mostly with data in newswire domains. Here, we report on a SRL model for identifying the semantic roles of biomedical predicates describing protein transport in GeneRIFs - manually curated sentences focusing on gene functions. To avoid the computational cost of syntactic parsing, and because the boundaries of our protein transport roles often did not match up with syntactic phrase boundaries, we approached this problem with a word-chunking paradigm and trained support vector machine classifiers to classify words as being at the beginning, inside or outside of a protein transport role.
We collected a set of 837 GeneRIFs describing movements of proteins between cellular components, whose predicates were annotated for the semantic roles AGENT, PATIENT, ORIGIN and DESTINATION. We trained these models with the features of previous word-chunking models, features adapted from phrase-chunking models, and features derived from an analysis of our data. Our models were able to label protein transport semantic roles with 87.6% precision and 79.0% recall when using manually annotated protein boundaries, and 87.0% precision and 74.5% recall when using automatically identified ones.
We successfully adapted the word-chunking classification paradigm to semantic role labeling, applying it to a new domain with predicates completely absent from any previous studies. By combining the traditional word and phrasal role labeling features with biomedical features like protein boundaries and MEDPOST part of speech tags, we were able to address the challenges posed by the new domain data and subsequently build robust models that achieved F-measures as high as 83.1. This system for extracting protein transport information from GeneRIFs performs well even with proteins identified automatically, and is therefore more robust than the rule-based methods previously used to extract protein transport roles.
自动语义角色标注(SRL)是一种自然语言处理(NLP)技术,可将句子映射为语义表示。近年来,这项技术得到了广泛研究,但大多是针对新闻领域的数据。在此,我们报告一种SRL模型,用于识别基因功能注释(GeneRIFs)中描述蛋白质转运的生物医学谓词的语义角色,GeneRIFs是专注于基因功能的人工整理句子。为避免句法剖析的计算成本,且由于我们的蛋白质转运角色边界通常与句法短语边界不匹配,我们采用词块划分范式处理此问题,并训练支持向量机分类器将单词分类为处于蛋白质转运角色的开头、内部或外部。
我们收集了一组837个描述蛋白质在细胞成分之间移动的GeneRIFs,其谓词被标注了施事(AGENT)、受事(PATIENT)、来源(ORIGIN)和目的地(DESTINATION)等语义角色。我们使用先前词块划分模型的特征、从短语块划分模型改编的特征以及对我们的数据进行分析得出的特征来训练这些模型。当使用手动注释的蛋白质边界时,我们的模型能够以87.6%的精确率和79.0%的召回率标记蛋白质转运语义角色;当使用自动识别的边界时,精确率为87.0%,召回率为74.5%。
我们成功地将词块划分分类范式应用于语义角色标注,将其应用于一个之前任何研究中都完全没有谓词的新领域。通过将传统的单词和短语角色标注特征与诸如蛋白质边界和MEDPOST词性标签等生物医学特征相结合,我们能够应对新领域数据带来的挑战,并随后构建出F值高达83.1的强大模型。这个从GeneRIFs中提取蛋白质转运信息的系统即使在自动识别蛋白质的情况下也表现良好,因此比以前用于提取蛋白质转运角色的基于规则的方法更强大。