Department of Biomedical Informatics, University of Utah, HSEB 5775, Salt Lake City, UT, USA.
BMC Med Inform Decis Mak. 2011 Feb 1;11:6. doi: 10.1186/1472-6947-11-6.
Traditional information retrieval techniques typically return excessive output when directed at large bibliographic databases. Natural Language Processing applications strive to extract salient content from the excessive data. Semantic MEDLINE, a National Library of Medicine (NLM) natural language processing application, highlights relevant information in PubMed data. However, Semantic MEDLINE implements manually coded schemas, accommodating few information needs. Currently, there are only five such schemas, while many more would be needed to realistically accommodate all potential users. The aim of this project was to develop and evaluate a statistical algorithm that automatically identifies relevant bibliographic data; the new algorithm could be incorporated into a dynamic schema to accommodate various information needs in Semantic MEDLINE, and eliminate the need for multiple schemas.
We developed a flexible algorithm named Combo that combines three statistical metrics, the Kullback-Leibler Divergence (KLD), Riloff's RlogF metric (RlogF), and a new metric called PredScal, to automatically identify salient data in bibliographic text. We downloaded citations from a PubMed search query addressing the genetic etiology of bladder cancer. The citations were processed with SemRep, an NLM rule-based application that produces semantic predications. SemRep output was processed by Combo, in addition to the standard Semantic MEDLINE genetics schema and independently by the two individual KLD and RlogF metrics. We evaluated each summarization method using an existing reference standard within the task-based context of genetic database curation.
Combo asserted 74 genetic entities implicated in bladder cancer development, whereas the traditional schema asserted 10 genetic entities; the KLD and RlogF metrics individually asserted 77 and 69 genetic entities, respectively. Combo achieved 61% recall and 81% precision, with an F-score of 0.69. The traditional schema achieved 23% recall and 100% precision, with an F-score of 0.37. The KLD metric achieved 61% recall, 70% precision, with an F-score of 0.65. The RlogF metric achieved 61% recall, 72% precision, with an F-score of 0.66.
Semantic MEDLINE summarization using the new Combo algorithm outperformed a conventional summarization schema in a genetic database curation task. It potentially could streamline information acquisition for other needs without having to hand-build multiple saliency schemas.
传统的信息检索技术在针对大型书目数据库时通常会返回过多的输出。自然语言处理应用程序致力于从过多的数据中提取突出的内容。语义 MEDLINE 是美国国家医学图书馆 (NLM) 的自然语言处理应用程序,它突出显示 PubMed 数据中的相关信息。然而,语义 MEDLINE 实现了手动编码的模式,仅能满足少数信息需求。目前,只有五个这样的模式,而要真正满足所有潜在用户的需求,则需要更多的模式。本项目的目的是开发和评估一种自动识别相关书目数据的统计算法;新算法可以合并到动态模式中,以满足语义 MEDLINE 中的各种信息需求,并消除对多个模式的需求。
我们开发了一种名为 Combo 的灵活算法,该算法结合了三个统计度量标准,即 Kullback-Leibler 散度 (KLD)、Riloff 的 RlogF 度量 (RlogF) 和一个新的度量标准 PredScal,以自动识别书目文本中的突出数据。我们从一个针对膀胱癌遗传病因的 PubMed 搜索查询中下载了引文。引文经过 NLM 基于规则的 SemRep 应用程序处理,该应用程序生成语义预测。除了标准的 Semantic MEDLINE 遗传学模式外,Combo 还处理 SemRep 输出,并分别由两个单独的 KLD 和 RlogF 度量处理。我们在遗传数据库管理的基于任务的上下文中使用现有的参考标准来评估每种摘要方法。
Combo 断言了 74 个与膀胱癌发展有关的遗传实体,而传统模式断言了 10 个遗传实体;KLD 和 RlogF 度量分别断言了 77 个和 69 个遗传实体。Combo 实现了 61%的召回率和 81%的精度,F1 得分为 0.69。传统模式实现了 23%的召回率和 100%的精度,F1 得分为 0.37。KLD 度量实现了 61%的召回率、70%的精度,F1 得分为 0.65。RlogF 度量实现了 61%的召回率、72%的精度,F1 得分为 0.66。
在遗传数据库管理任务中,使用新的 Combo 算法进行语义 MEDLINE 摘要优于传统摘要模式。它可能无需手动构建多个显著性模式,即可简化其他需求的信息获取。