Poulter Graham L, Rubin Daniel L, Altman Russ B, Seoighe Cathal
UCT NBN Node, Department of Molecular and Cell Biology, University of Cape Town, Cape Town, South Africa.
BMC Bioinformatics. 2008 Feb 19;9:108. doi: 10.1186/1471-2105-9-108.
Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains.
MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92.
MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at http://mscanner.stanford.edu.
通过PubMed及其他系统进行关键词搜索是从Medline检索信息的标准方法。然而,临时检索系统无法满足从文献中整理信息的数据库的所有需求,也无法满足文本挖掘人员针对一个有许多相关指示词的主题构建语料库的需求。一些数据库已经开发了监督学习方法,这些方法在Medline的一个经过筛选的子集中运行,对Medline记录进行分类,从而减少需要人工审核相关性的文章数量。有一些研究考虑将Medline分类进行泛化,以便以非特定领域的方式在整个Medline数据库上运行,但现有的应用程序缺乏速度、可用的实现方式,或者缺乏在新领域中衡量性能的手段。
MScanner是一个贝叶斯分类器的实现,它提供了一个简单的网页界面,用于以PubMed ID的形式提交相关训练示例的语料库,并返回按相关性概率递减排序的结果。为了实现最大速度,它使用医学主题词(MeSH)和出版物期刊作为简洁的文档表示形式,针对Medline中的1600万条记录返回结果大约需要90秒。该网页界面提供了对结果的交互式探索,以及针对Medline的一个随机子集对相关输入进行交叉验证的性能评估。我们描述了分类器的实现,在三个特定领域的主题上对其进行交叉验证,并将其性能与针对一个复杂主题的专家PubMed查询的性能进行比较。在针对100,000篇随机文章的三个示例主题的交叉验证中,该分类器在相关和不相关文章得分分布之间实现了出色的区分,ROC面积在0.97至0.9之间,平均精度在0.69至0.92之间。
MScanner是一个有效的非特定领域分类器,可在整个Medline数据库上运行,适用于检索有许多特征可能指示相关性的主题。与构建特定于该主题的预过滤器和分类器相比,其网页界面简化了对Medline引用进行分类的任务。用于获得本文结果的数据集和开源代码可在线获取并作为补充材料,网页界面可通过http://mscanner.stanford.edu访问。