Papadatos George, Gaulton Anna, Hersey Anne, Overington John P
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.
J Comput Aided Mol Des. 2015 Sep;29(9):885-96. doi: 10.1007/s10822-015-9860-5. Epub 2015 Jul 23.
The emergence of a number of publicly available bioactivity databases, such as ChEMBL, PubChem BioAssay and BindingDB, has raised awareness about the topics of data curation, quality and integrity. Here we provide an overview and discussion of the current and future approaches to activity, assay and target data curation of the ChEMBL database. This curation process involves several manual and automated steps and aims to: (1) maximise data accessibility and comparability; (2) improve data integrity and flag outliers, ambiguities and potential errors; and (3) add further curated annotations and mappings thus increasing the usefulness and accuracy of the ChEMBL data for all users and modellers in particular. Issues related to activity, assay and target data curation and integrity along with their potential impact for users of the data are discussed, alongside robust selection and filter strategies in order to avoid or minimise these, depending on the desired application.
一些公开可用的生物活性数据库的出现,如ChEMBL、PubChem生物测定数据库和BindingDB,提高了人们对数据管理、质量和完整性等主题的认识。在此,我们概述并讨论了ChEMBL数据库中当前和未来针对活性、测定和靶点数据管理的方法。这个管理过程涉及几个手动和自动步骤,目标是:(1) 最大限度地提高数据的可访问性和可比性;(2)提高数据完整性并标记异常值、模糊性和潜在错误;(3)添加进一步的管理注释和映射,从而提高ChEMBL数据对所有用户尤其是建模人员的有用性和准确性。讨论了与活性、测定和靶点数据管理及完整性相关的问题及其对数据用户的潜在影响,同时还讨论了稳健的选择和过滤策略,以便根据所需应用避免或最小化这些问题。