Unilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Rd., Cambridge CB2 1EW, UK.
J Chem Inf Model. 2010 Feb 22;50(2):251-61. doi: 10.1021/ci9003688.
The SPECTRa-T project has developed text-mining tools to extract named chemical entities (NCEs), such as chemical names and terms, and chemical objects (COs), e.g., experimental spectral assignments and physical chemistry properties, from electronic theses (e-theses). Although NCEs were readily identified within the two major document formats studied, only the use of structured documents enabled identification of chemical objects and their association with the relevant chemical entity (e.g., systematic chemical name). A corpus of theses was analyzed and it is shown that a high degree of semantic information can be extracted from structured documents. This integrated information has been deposited in a persistent Resource Description Framework (RDF) triple-store that allows users to conduct semantic searches. The strength and weaknesses of several document formats are reviewed.
SPECTRa-T 项目开发了文本挖掘工具,用于从电子论文(e-theses)中提取命名化学实体(NCEs),如化学名称和术语,以及化学对象(COs),例如实验光谱分配和物理化学性质。虽然在研究的两种主要文档格式中很容易识别 NCEs,但只有使用结构化文档才能识别化学对象及其与相关化学实体(例如系统化学名称)的关联。对论文语料库进行了分析,结果表明可以从结构化文档中提取出高度的语义信息。这些集成信息已被存入持久的资源描述框架(RDF)三元存储库中,允许用户进行语义搜索。对几种文档格式的优缺点进行了审查。