TW2Informatics Ltd, Göteborg, 42166, Sweden.
J Cheminform. 2013 Apr 23;5(1):20. doi: 10.1186/1758-2946-5-20.
Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.
Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and merged extractions.
This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.
探索生物活性化学需要在各种基于文本的来源之间进行结构和数据的导航。虽然 PubChem 目前包含大约 1600 万份文献提取结构(1500 万份来自专利),但公共文档间和文档到数据库链接的程度仍远低于任何估计的总数,尤其是对于期刊文章。chemicalize.org 的出现极大地扩展了对被埋没在文本中的化学物质的访问。该在线资源可以处理 IUPAC 名称、SMILES、InChI 字符串、CAS 号码和药物名称,方法是从粘贴的文本、PDF 或 URL 中生成结构、计算属性并启动搜索。在这里,我们探讨了它在回答与文档中化学结构相关的问题以及这些问题与数据库记录重叠方面的效用。使用二肽基肽酶 4(DPPIV)抑制剂这一常见主题来说明这些方面。
全文开放 URL 来源促成了从 DPPIV 专利中下载 1400 多个结构,并将特定示例与 IC50 数据对齐。将 SMILES 上传到 PubChem 揭示了与专利和论文的广泛链接,包括来自 chemicalize.org 的先前提交作为提交源。一篇 DPPIV 药物化学论文被完全提取,结构与活性结果表对齐,并通过 PubChem 与其他文档链接。在这两种情况下,具有数据的关键结构都通过将它们划分为单独的新 PDF 进行转换,从而与常见化学物质分开。还从与 DPPIV 抑制相关的一批 PubMed 摘要中提取了 500 多个结构。可以逐个浏览药物结构,并包括一些在 PubChem 中未链接的转换后的仅 MeSH IUPAC 名称。执行集合交集对于检测文档和合并提取之间的共有化合物非常有效。
这项工作展示了 chemicalize.org 在文档和数据库之间探索化学结构连接性的效用,包括在 PubChem 中进行结构搜索、在 Google 中进行 InChIKey 搜索以及在 chemicalize.org 档案中的搜索。它具有从任何内部、外部或 Web 源提取文本的灵活性。它与其他开放工具协同工作,并且应用程序正在不断开发中。因此,它应该促进药物化学、化学生物学和其他生物活性化学领域的发展。