IBM China Research Lab, Beijing 100193, China.
Adv Exp Med Biol. 2010;680:57-64. doi: 10.1007/978-1-4419-5913-3_7.
The ability to extract chemical and biological entities and relations from text documents automatically has great value to biochemical research and development activities. The growing maturity of text mining and artificial intelligence technologies shows promise in enabling such automatic chemical entity extraction capabilities (called "Chemical Annotation" in this paper). Many techniques have been reported in the literature, ranging from dictionary and rule-based techniques to machine learning approaches. In practice, we found that no single technique works well in all cases. A combinatorial approach that allows one to quickly compose different annotation techniques together for a given situation is most effective. In this paper, we describe the key challenges we face in real-world chemical annotation scenarios. We then present a solution called ChemBrowser which has a flexible framework for chemical annotation. ChemBrowser includes a suite of customizable processing units that might be utilized in a chemical annotator, a high-level language that describes the composition of various processing units that would form a chemical annotator, and an execution engine that translates the composition language to an actual annotator that can generate annotation results for a given set of documents. We demonstrate the impact of this approach by tailoring an annotator for extracting chemical names from patent documents and show how this annotator can be easily modified with simple configuration alone.
自动从文本文档中提取化学和生物实体及关系对生化研究和开发活动具有重要价值。文本挖掘和人工智能技术的日益成熟,有望实现这种自动化学实体提取能力(本文称为“化学标注”)。文献中已经报道了许多技术,从基于字典和规则的技术到机器学习方法都有涉及。在实践中,我们发现没有单一的技术在所有情况下都能很好地工作。一种组合方法,允许人们为给定的情况快速组合不同的标注技术,是最有效的。本文描述了我们在实际化学标注场景中面临的关键挑战。然后,我们提出了一个名为 ChemBrowser 的解决方案,它具有用于化学标注的灵活框架。ChemBrowser 包括一套可定制的处理单元,可以在化学标注器中使用;一种高级语言,用于描述构成化学标注器的各种处理单元的组合;以及一个执行引擎,它将组合语言转换为实际的标注器,可以为给定的文档集生成标注结果。我们通过针对从专利文档中提取化学名称的标注器进行定制,展示了这种方法的效果,并展示了仅通过简单的配置就可以轻松修改这个标注器。