Unilever Centre for Molecular Science Informatics, Department of Chemistry, Lensfield Road, Cambridge, CB2 1EW, UK.
J Cheminform. 2011 May 16;3(1):17. doi: 10.1186/1758-2946-3-17.
The primary method for scientific communication is in the form of published scientific articles and theses which use natural language combined with domain-specific terminology. As such, they contain free owing unstructured text. Given the usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP) approaches.
We have developed the ChemicalTagger parser as a medium-depth, phrase-based semantic NLP tool for the language of chemical experiments. Tagging is based on a modular architecture and uses a combination of OSCAR, domain-specific regex and English taggers to identify parts-of-speech. The ANTLR grammar is used to structure this into tree-based phrases. Using a metric that allows for overlapping annotations, we achieved machine-annotator agreements of 88.9% for phrase recognition and 91.9% for phrase-type identification (Action names).
It is possible parse to chemical experimental text using rule-based techniques in conjunction with a formal grammar parser. ChemicalTagger has been deployed for over 10,000 patents and has identified solvents from their linguistic context with >99.5% precision.
科学交流的主要方式是发表科学文章和论文,这些文章使用自然语言结合特定领域的术语。因此,它们包含自由的、非结构化的文本。鉴于从非结构化文献中提取数据的有用性,我们旨在展示如何为化学领域实现这一目标。大多数化学家采用的高度公式化的写作风格使他们的贡献非常适合高通量自然语言处理(NLP)方法。
我们开发了 ChemicalTagger 解析器,作为化学实验语言的中等深度、基于短语的语义 NLP 工具。标记基于模块化架构,结合使用 OSCAR、特定于域的正则表达式和英语标记器来识别词性。使用允许重叠注释的度量标准,我们实现了机器注释器对短语识别的 88.9%的机器注释器协议和对短语类型识别(动作名称)的 91.9%的协议。
使用基于规则的技术和形式语法解析器可以对化学实验文本进行解析。ChemicalTagger 已经部署了超过 10000 项专利,并已从语言环境中识别出溶剂,精度超过 99.5%。