使用深度学习从文档中提取分子结构。

Molecular Structure Extraction from Documents Using Deep Learning.

机构信息

Schrödinger, Inc. , 101 SW Main Street , Portland , Oregon 97204 , United States.

Schrödinger, Inc. , 120 West 45th Street , New York , New York 10036 , United States.

出版信息

J Chem Inf Model. 2019 Mar 25;59(3):1017-1029. doi: 10.1021/acs.jcim.8b00669. Epub 2019 Feb 27.

DOI:10.1021/acs.jcim.8b00669

PMID:30758950

Abstract

Chemical structure extraction from documents remains a hard problem because of both false positive identification of structures during segmentation and errors in the predicted structures. Current approaches rely on handcrafted rules and subroutines that perform reasonably well generally but still routinely encounter situations where recognition rates are not yet satisfactory and systematic improvement is challenging. Complications impacting the performance of current approaches include the diversity in visual styles used by various software to render structures, the frequent use of ad hoc annotations, and other challenges related to image quality, including resolution and noise. We present end-to-end deep learning solutions for both segmenting molecular structures from documents and predicting chemical structures from the segmented images. This deep-learning-based approach does not require any handcrafted features, is learned directly from data, and is robust against variations in image quality and style. Using the deep learning approach described herein, we show that it is possible to perform well on both segmentation and prediction of low-resolution images containing moderately sized molecules found in journal articles and patents.

摘要

由于在分割过程中对结构的误识别和预测结构的错误，从文档中提取化学结构仍然是一个难题。目前的方法依赖于手工制作的规则和子程序，这些规则和子程序通常表现得相当好，但仍然经常遇到识别率不尽如人意的情况，系统改进具有挑战性。影响当前方法性能的并发症包括各种软件用于渲染结构的视觉样式的多样性、经常使用特别注释以及与图像质量相关的其他挑战，包括分辨率和噪声。我们提出了从文档中分割分子结构和从分割图像中预测化学结构的端到端深度学习解决方案。这种基于深度学习的方法不需要任何手工制作的特征，直接从数据中学习，并且对图像质量和样式的变化具有鲁棒性。使用本文描述的深度学习方法，我们表明，对于包含期刊文章和专利中中等大小分子的低分辨率图像的分割和预测，都可以表现得很好。