Gobeill Julien, Tbahriti Imad, Ehrler Frédéric, Mottaz Anaïs, Veuthey Anne-Lise, Ruch Patrick
University and Hospitals of Geneva, Geneva, Switzerland.
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S9. doi: 10.1186/1471-2105-9-S3-S9.
This paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases.
Based on the TREC-2003 Genomics collection for GeneRiF identification, the LASt extraction strategy is already competitive (52.78%). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 57% (+10%).
Argumentative representation levels and conceptual density estimation using Gene Ontology contents appear complementary for functional annotation in proteomics.
本文描述并评估了一种句子选择引擎,该引擎基于MEDLINE记录提取ENTREZ - Gene中定义的基因功能参考(GeneRiF)。此任务的输入包括一个基因和一个指向MEDLINE参考文献的指针。在建议的方法中,我们合并了两种独立的句子提取策略。第一种提出的策略(LASt)使用受话语分析模型启发的论证特征。第二种提取方案(GOEx)使用自动文本分类器来估计每个句子中基因本体类别(Gene Ontology)的密度;从而对所有可能的候选基因功能参考进行全面排名。提出了两种方法的组合,其目的还在于通过过滤掉无内容的修辞短语来减小所选片段的大小。
基于TREC - 2003基因组学数据集进行基因功能参考识别,LASt提取策略已经具有竞争力(52.78%)。当用于组合方法时,提取任务明显显示出改进,获得了超过57%的Dice分数(提高了10%)。
论证表示水平和使用基因本体内容的概念密度估计在蛋白质组学的功能注释中似乎具有互补性。