Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität , Dahlmannstr. 2, D-53113 Bonn, Germany.
J Chem Inf Model. 2014 Nov 24;54(11):3056-66. doi: 10.1021/ci5005509. Epub 2014 Oct 31.
Compound activity data grow at unprecedented rates, and their complexity increases. This challenges compound data mining efforts and makes it difficult to draw reliable conclusions from data analysis. We have aimed to investigate the influence of individual parameters and data confidence levels on compound selection and property assessment. Therefore, alternative sets of bioactive compounds were systematically extracted from ChEMBL on the basis of iteratively expanding selection criteria with increasing stringency covering a variety of search parameters. The sequential application of criteria for the selection of high-confidence compound data was order-independent, as expected. Furthermore, the influence of separately applied selection criteria was analyzed. Criteria that largely influenced compound selection and compound promiscuity rates were identified. In the presence of stringent selection criteria and high data confidence, many compounds with likely assay artifacts or liabilities were eliminated from further consideration. Taken together, the findings of our analysis emphasize the need to carefully consider search parameters related to target organisms, confidence level of activity, and activity measurements and suggest reliable protocols for compound data mining.
化合物活性数据以前所未有的速度增长,其复杂性也在增加。这给化合物数据挖掘工作带来了挑战,使得从数据分析中得出可靠的结论变得困难。我们旨在研究个体参数和数据置信度水平对化合物选择和性质评估的影响。因此,根据不断提高的严格程度,我们从 ChEMBL 中系统地提取了一系列替代的生物活性化合物,涵盖了各种搜索参数。正如预期的那样,选择高置信度化合物数据的标准的顺序是独立的。此外,我们还分析了单独应用选择标准的影响。确定了对化合物选择和化合物混杂率有较大影响的标准。在严格的选择标准和高数据置信度的情况下,许多可能存在测定假象或缺陷的化合物被排除在进一步考虑之外。总的来说,我们分析结果强调需要仔细考虑与目标生物、活性置信度以及活性测量相关的搜索参数,并为化合物数据挖掘提出可靠的方案。