Rekapalli Hari Krishna, Cohen Aaron M, Hersh William R
Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239, USA.
AMIA Annu Symp Proc. 2007 Oct 11;2007:620-4.
Identify the set of features that best explained the variation in the performance measure of TREC 2006 Genomics information extraction task, Mean Average Passage Precision (MAPP).
A multivariate regression model was built using a backward-elimination approach as a function of certain generalized features that were common to all the algorithms used by TREC 2006 Genomics track participants.
Our regression analysis found that the following four factors were collectively associated with variation in MAPP: (1) Normalization of keywords in the query (2) Use of Entrez gene thesaurus for synonymous terms look-up (3) Unit of text retrieved using respective IR algorithms and (4) The way a passage was defined.
These reasonably likely hypotheses, generated by an exploratory data analysis, are informative in understanding results of the TREC 2006 Genomics passage extraction task. This approach has general value for analyzing the results of similar common challenge tasks.
确定最能解释TREC 2006基因组信息提取任务性能指标平均段落精度(MAPP)变化的一组特征。
使用向后消除法构建多元回归模型,该模型是TREC 2006基因组赛道参与者使用的所有算法共有的某些广义特征的函数。
我们的回归分析发现,以下四个因素与MAPP的变化共同相关:(1)查询中关键词的规范化;(2)使用Entrez基因词库进行同义词查找;(3)使用各自的信息检索算法检索的文本单元;(4)段落的定义方式。
这些由探索性数据分析得出的合理假设,对于理解TREC 2006基因组段落提取任务的结果具有参考价值。这种方法对于分析类似的常见挑战性任务的结果具有普遍价值。