Sun Jiangming, Jeliazkova Nina, Chupakin Vladimir, Golib-Dzib Jose-Felipe, Engkvist Ola, Carlsson Lars, Wegner Jörg, Ceulemans Hugo, Georgiev Ivan, Jeliazkov Vedrin, Kochev Nikolay, Ashby Thomas J, Chen Hongming
Discovery Sciences, Innovative Medicines and Early Development Biotech Unit, AstraZeneca R&D Gothenburg, 43183 Mölndal, Sweden.
Ideaconsult Ltd., 4. Angel Kanchev Str., 1000 Sofia, Bulgaria.
J Cheminform. 2017 Mar 7;9:17. doi: 10.1186/s13321-017-0203-5. eCollection 2017.
Chemogenomics data generally refers to the activity data of chemical compounds on an array of protein targets and represents an important source of information for building target prediction models. The increasing volume of chemogenomics data offers exciting opportunities to build models based on Big Data. Preparing a high quality data set is a vital step in realizing this goal and this work aims to compile such a comprehensive chemogenomics dataset. This dataset comprises over 70 million SAR data points from publicly available databases (PubChem and ChEMBL) including structure, target information and activity annotations. Our aspiration is to create a useful chemogenomics resource reflecting industry-scale data not only for building predictive models of polypharmacology and off-target effects but also for the validation of cheminformatics approaches in general.
化学基因组学数据通常指化学化合物在一系列蛋白质靶点上的活性数据,是构建靶点预测模型的重要信息来源。化学基因组学数据量的不断增加为基于大数据构建模型提供了令人兴奋的机会。准备高质量的数据集是实现这一目标的关键步骤,这项工作旨在汇编这样一个全面的化学基因组学数据集。该数据集包含来自公开数据库(PubChem和ChEMBL)的超过7000万个SAR数据点,包括结构、靶点信息和活性注释。我们的愿望是创建一个有用的化学基因组学资源,不仅用于构建多药理学和脱靶效应的预测模型,还用于一般化学信息学方法的验证,该资源反映了行业规模的数据。