开启文档、结构与生物活性之间的连通性。

Opening up connectivity between documents, structures and bioactivity.

作者信息

Southan Christopher

机构信息

Deanery of Biomedical Sciences, University of Edinburgh, Edinburgh, EH8 9XD, UK.

TW2Informatics Ltd, Västra Frölunda, Gothenburg, 42166, Sweden.

出版信息

Beilstein J Org Chem. 2020 Apr 2;16:596-606. doi: 10.3762/bjoc.16.54. eCollection 2020.

DOI:10.3762/bjoc.16.54

PMID:32280387

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7136548/

Abstract

Bioscientists reading papers or patents strive to discern the key relationships reported within a document "D" where a bioactivity "A" with a quantitative result "R" (e.g., an IC) is reported for chemical structure "C" that modulates (e.g., inhibits) a protein target "P". A useful shorthand for this connectivity thus becomes DARCP. The problem at the core of this article is that the community has spent millions effectively burying these relationships in PDFs over many decades but must now spend millions more trying to get them back out. The key imperative for this is to increase the flow into structured open databases. The positive impacts will include expanded data mining opportunities for drug discovery and chemical biology. Over the last decade commercial sources have manually extracted DARCP from ≈300,000 documents encompassing ≈7 million compounds interacting with ≈10,000 targets. Over a similar time, the Guide to Pharmacology, BindingDB and ChEMBL have carried out analogues DARCP extractions. Although their expert-curated numbers are lower (i.e., ≈2 million compounds against ≈3700 human proteins), these open sources have the great advantage of being merged within PubChem. Parallel efforts have focused on the extraction of document-to-compound (D-C-only) connectivity. In the absence of molecular mechanism of action (mmoa) annotation, this is of less value but can be automatically extracted. This has been significantly accomplished for patents, (e.g., by IBM, SureChEMBL and WIPO) for over 30 million compounds in PubChem. These have recently been joined by 1.4 million D-C submissions from three major chemistry publishers. In addition, both the European and US PubMed Central portals now add chemistry look-ups from abstracts and full-text papers. However, the fully automated extraction of DARCLP has not yet been achieved. This stands in contrast to the ability of biocurators to discern these relationships in minutes. Unfortunately, no journals have yet instigated a flow of author-specified DARCP directly into open databases. Progress may come from trends such as open science, open access (OA), findable, accessible, interoperable and reusable (FAIR), resource description framework (RDF) and WikiData. However, we will need to await the technical applicability in respect to DARCP capture to see if this opens up connectivity.

摘要

阅读论文或专利的生物科学家们努力在一份文献“D”中识别所报告的关键关系，在该文献中，针对调节（如抑制）蛋白质靶点“P”的化学结构“C”，报告了具有定量结果“R”（如IC）的生物活性“A”。因此，这种关联关系的一种有用的简写形式就变成了DARCP。本文核心的问题是，几十年来，科学界实际上已经花费了数百万资金将这些关系埋没在PDF文件中，但现在又必须花费数百万资金试图将它们重新挖掘出来。对此的关键要求是增加流入结构化开放数据库的数据量。积极影响将包括为药物发现和化学生物学带来更多的数据挖掘机会。在过去十年中，商业机构已从约30万份文献中人工提取了DARCP，这些文献涵盖了约700万种与约10000个靶点相互作用的化合物。在类似的时间段内，《药理学指南》《BindingDB》和《ChEMBL》也进行了类似的DARCP提取。尽管它们经专家整理的数据量较少（即约200万种化合物针对约3700种人类蛋白质），但这些开源数据具有可在PubChem中合并的巨大优势。并行的努力集中在文献与化合物（仅D-C）关联关系的提取上。在缺乏作用分子机制（mmoa）注释的情况下，这一价值较小，但可以自动提取。对于专利，这一点已经有了显著成果（例如由IBM、SureChEMBL和世界知识产权组织完成），涉及PubChem中的3000多万种化合物。最近，来自三大化学出版商的140万条D-C提交数据也加入进来。此外，欧洲和美国的PubMed Central门户现在也增加了从摘要和全文论文中进行化学查询的功能。然而，DARCLP的全自动提取尚未实现。这与生物编目人员在几分钟内就能识别这些关系的能力形成了对比。不幸的是，目前还没有期刊促使作者指定的DARCP直接流入开放数据库。进展可能来自开放科学、开放获取（OA）、可查找、可访问、可互操作和可重用（FAIR）、资源描述框架（RDF）和维基数据等趋势。然而，我们需要等待关于DARCP捕获的技术适用性，看看这是否能开启关联关系。