Wilbur W John, Rzhetsky Andrey, Shatkay Hagit
National Center for Biotechnology Information NLM, NIH, Bethesda, MD, USA.
BMC Bioinformatics. 2006 Jul 25;7:356. doi: 10.1186/1471-2105-7-356.
While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined.
We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70-80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task.
We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available.
虽然生物医学文本挖掘正成为一个重要的研究领域,但实践结果证明难以实现。我们认为,迈向更准确文本挖掘的重要第一步在于识别和描述满足各种信息需求的文本的能力。我们在此报告我们对科学文本属性的探究结果,这些属性具有足够的普遍性,能够超越狭窄学科领域的限制,同时支持对文本进行事实信息的实际挖掘。我们的最终目标是注释大量生物医学文本语料库,并训练机器学习方法,以便根据我们定义的某些维度对这些文本进行自动分类。
我们确定了五个定性维度,我们认为这些维度可以描述广泛的科学句子,因此有助于支持文本挖掘的通用方法:焦点、极性、确定性、证据和方向性。我们定义了这些维度,并描述了我们针对这些维度注释文本所制定的指导方针。为了检验这些指导方针的有效性,12名注释者独立注释了从当前生物医学期刊中随机选取的同一组101个句子。对这些注释的分析表明,注释者之间的一致性达到70 - 80%,这表明我们的指导方针确实提出了一个定义明确、可执行且可重复的任务。
我们提出了定义文本注释任务的指导方针,以及多个独立生成的注释结果,证明了该任务的可行性。目前正在按照这些指导方针注释大量文档语料库。这些注释构成了沿多个维度对文本进行分类的基础,以支持对实验结果、方法声明和其他形式信息进行可行的文本挖掘。我们目前正在开发机器学习方法,将在注释语料库上进行训练和测试,从而能够根据我们提出的通用维度对生物医学文本进行自动分类。详细的指导方针以及带注释的示例可公开获取。