Townsend Joe A, Adams Sam E, Waudby Christopher A, de Souza Vanessa K, Goodman Jonathan M, Murray-Rust Peter
Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK.
Org Biomol Chem. 2004 Nov 21;2(22):3294-300. doi: 10.1039/b411033a. Epub 2004 Oct 20.
Automatically extracting chemical information from documents is a challenging task, but an essential one for dealing with the vast quantity of data that is available. The task is least difficult for structured documents, such as chemistry department web pages or the output of computational chemistry programs, but requires increasingly sophisticated approaches for less structured documents, such as chemical papers. The identification of key units of information, such as chemical names, makes the extraction of useful information from unstructured documents possible.
从文档中自动提取化学信息是一项具有挑战性的任务,但对于处理现有的大量数据而言却是一项必不可少的任务。对于结构化文档,如化学系网页或计算化学程序的输出,该任务难度最小,但对于结构化程度较低的文档,如化学论文,则需要越来越复杂的方法。识别关键信息单元,如化学名称,使得从非结构化文档中提取有用信息成为可能。