Filippov Igor V, Nicklaus Marc C
Laboratory of Medicinal Chemistry, SAIC-Frederick, Inc., NCI-Frederick, Frederick, Maryland 21702, USA.
J Chem Inf Model. 2009 Mar;49(3):740-3. doi: 10.1021/ci800067r.
Until recently most scientific and patent documents dealing with chemistry have described molecular structures either with systematic names or with graphical images of Kekulé structures. The latter method poses inherent problems in the automated processing that is needed when the number of documents ranges in the hundreds of thousands or even millions since graphical representations cannot be directly interpreted by a computer. To recover this structural information, which is otherwise all but lost, we have built an optical structure recognition application based on modern advances in image processing implemented in open source tools, OSRA. OSRA can read documents in over 90 graphical formats including GIF, JPEG, PNG, TIFF, PDF, and PS, automatically recognizes and extracts the graphical information representing chemical structures in such documents, and generates the SMILES or SD representation of the encountered molecular structure images.
直到最近,大多数涉及化学的科学文献和专利文件都是用系统命名法或凯库勒结构的图形图像来描述分子结构的。后一种方法在文档数量达到数十万甚至数百万时所需的自动化处理中存在固有问题,因为图形表示不能被计算机直接解读。为了恢复这些否则就会几乎丢失的结构信息,我们基于开源工具中实现的图像处理方面的现代进展构建了一个光学结构识别应用程序,即OSRA。OSRA可以读取包括GIF、JPEG、PNG、TIFF、PDF和PS在内的90多种图形格式的文档,自动识别并提取此类文档中表示化学结构的图形信息,并生成所遇到的分子结构图像的SMILES或SD表示。