Miotto Olivo, Tan Tin Wee, Brusic Vladimir
Institute of Systems Science, National University of Singapore, 25 Heng Mui Keng Terrace, Singapore 119615.
Genome Inform. 2005;16(2):32-44.
Curators of biological databases transfer knowledge from scientific publications, a laborious and expensive manual process. Machine learning algorithms can reduce the workload of curators by filtering relevant biomedical literature, though their widespread adoption will depend on the availability of intuitive tools that can be configured for a variety of tasks. We propose a new method for supporting curators by means of document categorization, and describe the architecture of a curator-oriented tool implementing this method using techniques that require no computational linguistic or programming expertise. To demonstrate the feasibility of this approach, we prototyped an application of this method to support a real curation task: identifying PubMed abstracts that contain allergen cross-reactivity information. We tested the performance of two different classifier algorithms (CART and ANN), applied to both composite and single-word features, using several feature scoring functions. Both classifiers exceeded our performance targets, the ANN classifier yielding the best results. These results show that the method we propose can deliver the level of performance needed to assist database curation.
生物数据库的管理者将知识从科学出版物中提取出来,这是一个费力且昂贵的手工过程。机器学习算法可以通过筛选相关生物医学文献来减轻管理者的工作量,不过它们的广泛应用将取决于是否有可配置用于各种任务的直观工具。我们提出了一种通过文档分类来支持管理者的新方法,并描述了一个面向管理者的工具的架构,该工具使用不需要计算语言学或编程专业知识的技术来实现此方法。为了证明这种方法的可行性,我们对该方法的一个应用进行了原型设计,以支持一项实际的编目任务:识别包含过敏原交叉反应信息的PubMed摘要。我们使用几种特征评分函数,测试了应用于复合特征和单字特征的两种不同分类器算法(CART和ANN)的性能。两个分类器都超过了我们的性能目标,其中ANN分类器产生了最佳结果。这些结果表明,我们提出的方法能够提供协助数据库编目所需的性能水平。