Department of Electrical Engineering, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 70101, Taiwan.
Bioinformatics. 2012 Aug 15;28(16):2162-8. doi: 10.1093/bioinformatics/bts367. Epub 2012 Jul 2.
Determination of the binding affinity of a protein-ligand complex is important to quantitatively specify whether a particular small molecule will bind to the target protein. Besides, collection of comprehensive datasets for protein-ligand complexes and their corresponding binding affinities is crucial in developing accurate scoring functions for the prediction of the binding affinities of previously unknown protein-ligand complexes. In the past decades, several databases of protein-ligand-binding affinities have been created via visual extraction from literature. However, such approaches are time-consuming and most of these databases are updated only a few times per year. Hence, there is an immediate demand for an automatic extraction method with high precision for binding affinity collection.
We have created a new database of protein-ligand-binding affinity data, AutoBind, based on automatic information retrieval. We first compiled a collection of 1586 articles where the binding affinities have been marked manually. Based on this annotated collection, we designed four sentence patterns that are used to scan full-text articles as well as a scoring function to rank the sentences that match our patterns. The proposed sentence patterns can effectively identify the binding affinities in full-text articles. Our assessment shows that AutoBind achieved 84.22% precision and 79.07% recall on the testing corpus. Currently, 13 616 protein-ligand complexes and the corresponding binding affinities have been deposited in AutoBind from 17 221 articles.
AutoBind is automatically updated on a monthly basis, and it is freely available at http://autobind.csie.ncku.edu.tw/ and http://autobind.mc.ntu.edu.tw/. All of the deposited binding affinities have been refined and approved manually before being released.
确定蛋白质-配体复合物的结合亲和力对于定量说明特定小分子是否会与靶蛋白结合非常重要。此外,收集全面的蛋白质-配体复合物数据集及其相应的结合亲和力对于开发准确的评分函数以预测以前未知的蛋白质-配体复合物的结合亲和力至关重要。在过去的几十年中,已经通过从文献中进行视觉提取创建了几个蛋白质-配体结合亲和力数据库。然而,这种方法耗时且大多数数据库每年仅更新几次。因此,需要一种具有高精度的自动提取方法来收集结合亲和力。
我们基于自动信息检索创建了一个新的蛋白质-配体结合亲和力数据库 AutoBind。我们首先编译了一个包含 1586 篇文章的集合,其中已经手动标记了结合亲和力。基于这个带注释的集合,我们设计了四个句子模式,用于扫描全文文章以及一个评分函数来对匹配我们模式的句子进行排名。所提出的句子模式可以有效地识别全文文章中的结合亲和力。我们的评估表明,AutoBind 在测试语料库上的精度达到 84.22%,召回率达到 79.07%。目前,已经从 17221 篇文章中向 AutoBind 中存入了 13616 个蛋白质-配体复合物及其相应的结合亲和力。
AutoBind 每月自动更新,可在 http://autobind.csie.ncku.edu.tw/ 和 http://autobind.mc.ntu.edu.tw/ 免费获得。所有存入的结合亲和力在发布之前都经过了手动精制和批准。