McDonald Ryan, Scott Winters R, Ankuda Claire K, Murphy Joan A, Rogers Amy E, Pereira Fernando, Greenblatt Marc S, White Peter S
Department of Computer and Information Science, University of Pennsylvania, Philadelphia, USA.
Hum Mutat. 2006 Sep;27(9):957-64. doi: 10.1002/humu.20363.
The proliferation of biomedical literature makes it increasingly difficult for researchers to find and manage relevant information. However, identifying research articles containing mutation data, a requisite first step in integrating large and complex mutation data sets, is currently tedious, time-consuming and imprecise. More effective mechanisms for identifying articles containing mutation information would be beneficial both for the curation of mutation databases and for individual researchers. We developed an automated method that uses information extraction, classifier, and relevance ranking techniques to determine the likelihood of MEDLINE abstracts containing information regarding genomic variation data suitable for inclusion in mutation databases. We targeted the CDKN2A (p16) gene and the procedure for document identification currently used by CDKN2A Database curators as a measure of feasibility. A set of abstracts was manually identified from a MEDLINE search as potentially containing specific CDKN2A mutation events. A subset of these abstracts was used as a training set for a maximum entropy classifier to identify text features distinguishing "relevant" from "not relevant" abstracts. Each document was represented as a set of indicative word, word pair, and entity tagger-derived genomic variation features. When applied to a test set of 200 candidate abstracts, the classifier predicted 88 articles as being relevant; of these, 29 of 32 manuscripts in which manual curation found CDKN2A sequence variants were positively predicted. Thus, the set of potentially useful articles that a manual curator would have to review was reduced by 56%, maintaining 91% recall (sensitivity) and more than doubling precision (positive predictive value). Subsequent expansion of the training set to 494 articles yielded similar precision and recall rates, and comparison of the original and expanded trials demonstrated that the average precision improved with the larger data set. Our results show that automated systems can effectively identify article subsets relevant to a given task and may prove to be powerful tools for the broader research community. This procedure can be readily adapted to any or all genes, organisms, or sets of documents.
生物医学文献的激增使得研究人员越来越难以找到并管理相关信息。然而,识别包含突变数据的研究文章是整合大型复杂突变数据集的必要第一步,目前这一过程既繁琐、耗时又不准确。对于突变数据库的管理以及个别研究人员而言,更有效的识别包含突变信息文章的机制将大有裨益。我们开发了一种自动化方法,该方法使用信息提取、分类器和相关性排序技术来确定MEDLINE摘要中包含适合纳入突变数据库的基因组变异数据信息的可能性。我们以CDKN2A(p16)基因以及CDKN2A数据库管理员当前用于文档识别的程序为目标,以此作为可行性的衡量标准。通过对MEDLINE搜索结果进行人工筛选,确定了一组可能包含特定CDKN2A突变事件的摘要。这些摘要的一个子集被用作最大熵分类器的训练集,以识别区分“相关”和“不相关”摘要的文本特征。每篇文档都被表示为一组指示性单词、单词对以及实体标记衍生的基因组变异特征。当应用于200篇候选摘要的测试集时,该分类器预测88篇文章相关;其中,人工筛选发现含有CDKN2A序列变异的32篇手稿中有29篇被正确预测。因此,人工筛选人员必须审阅的潜在有用文章集减少了56%,召回率(敏感度)保持在91%,精确率(阳性预测值)提高了一倍多。随后将训练集扩展至494篇文章,得到了相似的精确率和召回率,对原始试验和扩展试验的比较表明,更大的数据集提高了平均精确率。我们的结果表明,自动化系统能够有效地识别与给定任务相关的文章子集,对于更广泛的研究群体而言,可能会被证明是强大的工具。该程序可以很容易地适用于任何或所有基因、生物体或文档集。