使用贝叶斯方法在医学在线数据库（Medline）中查找与基因和蛋白质相关的参考文献。

Finding relevant references to genes and proteins in Medline using a Bayesian approach.

作者信息

Leonard Julie E, Colombe Jeffrey B, Levy Joshua L

机构信息

Incellico Inc, 2327 Englert Dr, Durham, NC 27713, USA.

出版信息

Bioinformatics. 2002 Nov;18(11):1515-22. doi: 10.1093/bioinformatics/18.11.1515.

DOI:10.1093/bioinformatics/18.11.1515

PMID:12424124

Abstract

MOTIVATION

Mining the biomedical literature for references to genes and proteins always involves a tradeoff between high precision with false negatives, and high recall with false positives. Having a reliable method for assessing the relevance of literature mining results is crucial to finding ways to balance precision and recall, and for subsequently building automated systems to analyze these results. We hypothesize that abstracts and titles that discuss the same gene or protein use similar words. To validate this hypothesis, we built a dictionary- and rule-based system to mine Medline for references to genes and proteins, and used a Bayesian metric for scoring the relevance of each reference assignment.

RESULTS

We analyzed the entire set of Medline records from 1966 to late 2001, and scored each gene and protein reference using a Bayesian estimated probability (EP) based on word frequency in a training set of 137837 known assignments from 30594 articles to 36197 gene and protein symbols. Two test sets of 148 and 150 randomly chosen assignments, respectively, were hand-validated and categorized as either good or bad. The distributions of EP values, when plotted on a log-scale histogram, are shown to markedly differ between good and bad assignments. Using EP values, recall was 100% at 61% precision (EP=2 x 10(-5)), 63% at 88% precision (EP=0.008), and 10% at 100% precision (EP=0.1). These results show that Medline entries discussing the same gene or protein have similar word usage, and that our method of assessing this similarity using EP values is valid, and enables an EP cutoff value to be determined that accurately and reproducibly balances precision and recall, allowing automated analysis of literature mining results. .

摘要

动机

从生物医学文献中挖掘基因和蛋白质相关内容时，始终需要在高精确率（但存在假阴性）和高召回率（但存在假阳性）之间进行权衡。拥有一种可靠的方法来评估文献挖掘结果的相关性，对于找到平衡精确率和召回率的方法以及随后构建自动系统来分析这些结果至关重要。我们假设讨论相同基因或蛋白质的摘要和标题会使用相似的词汇。为了验证这一假设，我们构建了一个基于词典和规则的系统，用于在Medline中挖掘基因和蛋白质相关内容，并使用贝叶斯度量对每个参考文献赋值的相关性进行评分。

结果

我们分析了1966年至2001年末的全部Medline记录，并基于来自30594篇文章中36197个基因和蛋白质符号的137837个已知赋值训练集中的词频，使用贝叶斯估计概率（EP）对每个基因和蛋白质参考文献进行评分。分别对两个包含148个和150个随机选择赋值的测试集进行人工验证，并分类为好或坏。当以对数尺度直方图绘制时，好的和坏的赋值的EP值分布明显不同。使用EP值，在精确率为61%（EP = 2×10⁻⁵）时召回率为100%，在精确率为88%（EP = 0.008）时召回率为63%，在精确率为100%（EP = 0.1）时召回率为10%。这些结果表明，讨论相同基因或蛋白质的Medline条目具有相似的词汇用法，并且我们使用EP值评估这种相似性的方法是有效的，能够确定一个EP截止值，该值可以准确且可重复地平衡精确率和召回率，从而实现对文献挖掘结果的自动分析。