Craven M, Kumlien J
School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213-3891, USA.
Proc Int Conf Intell Syst Mol Biol. 1999:77-86.
Recently, there has been much effort in making databases for molecular biology more accessible and interoperable. However, information in text form, such as MEDLINE records, remains a greatly underutilized source of biological information. We have begun a research effort aimed at automatically mapping information from text sources into structured representations, such as knowledge bases. Our approach to this task is to use machine-learning methods to induce routines for extracting facts from text. We describe two learning methods that we have applied to this task--a statistical text classification method, and a relational learning method--and our initial experiments in learning such information-extraction routines. We also present an approach to decreasing the cost of learning information-extraction routines by learning from "weakly" labeled training data.
最近,人们在使分子生物学数据库更易于访问和互操作方面付出了很多努力。然而,诸如MEDLINE记录等文本形式的信息仍然是一个未得到充分利用的生物信息来源。我们已经开始了一项研究工作,旨在将文本来源的信息自动映射到结构化表示形式,如知识库。我们处理这项任务的方法是使用机器学习方法来归纳从文本中提取事实的例程。我们描述了两种应用于这项任务的学习方法——一种统计文本分类方法和一种关系学习方法——以及我们在学习此类信息提取例程方面的初步实验。我们还提出了一种通过从“弱”标记的训练数据中学习来降低学习信息提取例程成本的方法。