用于药物发现的分类树的采集。

Harvesting classification trees for drug discovery.

机构信息

Department of Public Health Sciences, University of Alberta, Edmonton, Alberta T6G 1C9, Canada.

出版信息

J Chem Inf Model. 2012 Dec 21;52(12):3169-80. doi: 10.1021/ci3000216. Epub 2012 Nov 16.

Abstract

Millions of compounds are available as potential drug candidates. High throughput screening (HTS) is widely used in drug discovery to assay compounds for a particular biological activity. A common approach is to build a classification model using a smaller sample of assay data to predict the activity of unscreened compounds and hence select further compounds for assay. This improves the efficiency of the search by increasing the proportion of hits found among the assayed compounds. In many assays, the biological activity is dichotomized into a binary indicator variable; the explanatory variables are chemical descriptors capturing compound structure. A tree model is interpretable, which is key, since it is of interest to identify diverse chemical classes among the active compounds to serve as leads for drug optimization. Interpretability of a tree is often reduced, however, by the sheer size of the tree model and the number of variables and rules of the terminal nodes. We develop a "tree harvesting" algorithm to filter out redundant "junk" rules from the tree while retaining its predictive accuracy. This simplification can facilitate the process of uncovering key relations between molecular structure and activity and may clarify rules defining multiple activity mechanisms. Using data from the National Cancer Institute, we illustrate that many of the rules used to build a classification tree may be redundant. Unlike tree pruning, tree harvesting allows variables with junk rules to be removed near the top of the tree. The reduction in complexity of the terminal nodes improves the interpretability of the model. The algorithm also aims to reorganize the tree nodes associated with the interesting "active" class into larger, more coherent groups, thus facilitating identification of the mechanisms for activity.

摘要

数以百万计的化合物都可以作为潜在的药物候选物。高通量筛选（HTS）广泛应用于药物发现中，以检测特定生物活性的化合物。一种常见的方法是使用较小的检测数据样本构建分类模型，以预测未筛选化合物的活性，从而选择进一步进行检测的化合物。这通过增加检测化合物中发现命中的比例来提高搜索效率。在许多测定中，生物活性被二分化成二元指示变量；解释变量是捕获化合物结构的化学描述符。树模型是可解释的，这是关键的，因为在活性化合物中确定不同的化学类群作为药物优化的先导是很重要的。然而，由于树模型的大小、变量的数量以及终端节点的规则，树的可解释性往往会降低。我们开发了一种“树收割”算法，该算法可以在保留预测准确性的同时，从树中过滤掉冗余的“垃圾”规则。这种简化可以促进发现分子结构与活性之间关键关系的过程，并可能阐明定义多种活性机制的规则。我们使用国家癌症研究所的数据来说明，构建分类树所使用的许多规则可能是冗余的。与树修剪不同，树收割允许删除具有垃圾规则的变量接近树的顶部。终端节点的复杂性降低提高了模型的可解释性。该算法还旨在将与有趣的“活性”类相关的树节点重新组织成更大、更一致的组，从而有助于确定活性机制。

相似文献

Harvesting classification trees for drug discovery.用于药物发现的分类树的采集。

J Chem Inf Model. 2012 Dec 21;52(12):3169-80. doi: 10.1021/ci3000216. Epub 2012 Nov 16.

Biodiversity of small molecules--a new perspective in screening set selection.小分子的生物多样性——筛选集选择的新视角。

Drug Discov Today. 2013 Jul;18(13-14):674-80. doi: 10.1016/j.drudis.2013.02.005. Epub 2013 Feb 20.

Profile-QSAR: a novel meta-QSAR method that combines activities across the kinase family to accurately predict affinity, selectivity, and cellular activity.谱定量构效关系（Profile-QSAR）：一种新型的元定量构效关系方法，它结合了激酶家族的各项活性，可准确预测亲和力、选择性和细胞活性。

J Chem Inf Model. 2011 Aug 22;51(8):1942-56. doi: 10.1021/ci1005004. Epub 2011 Jul 19.

Trade-off between accuracy and interpretability for predictive in silico modeling.预测性计算建模中准确性和可解释性的权衡。

Future Med Chem. 2011 Apr;3(6):647-63. doi: 10.4155/fmc.11.23.

Discovery of novel anti-inflammatory drug-like compounds by aligning in silico and in vivo screening: the nitroindazolinone chemotype.通过体内外筛选的协同作用发现新型抗炎药物样化合物：硝基吲唑啉酮类化合物。

Eur J Med Chem. 2011 Dec;46(12):5736-53. doi: 10.1016/j.ejmech.2011.07.053. Epub 2011 Aug 17.

Structural alert/reactive metabolite concept as applied in medicinal chemistry to mitigate the risk of idiosyncratic drug toxicity: a perspective based on the critical examination of trends in the top 200 drugs marketed in the United States.结构警示/反应性代谢物概念在药物化学中的应用，以降低药物特异质毒性的风险：基于对美国市场销售的前 200 种药物趋势的批判性考察的观点。

Chem Res Toxicol. 2011 Sep 19;24(9):1345-410. doi: 10.1021/tx200168d. Epub 2011 Jul 11.

Rules for identifying potentially reactive or promiscuous compounds.潜在反应性或混杂化合物的鉴定规则。

J Med Chem. 2012 Nov 26;55(22):9763-72. doi: 10.1021/jm301008n. Epub 2012 Oct 25.

A comprehensive support vector machine binary hERG classification model based on extensive but biased end point hERG data sets.基于广泛但存在偏倚的终点 hERG 数据集的全面支持向量机二进制 hERG 分类模型。

Chem Res Toxicol. 2011 Jun 20;24(6):934-49. doi: 10.1021/tx200099j. Epub 2011 May 6.

Early phase drug discovery: cheminformatics and computational techniques in identifying lead series.早期药物发现：化学信息学和计算技术在鉴定先导化合物系列中的应用。

Bioorg Med Chem. 2012 Sep 15;20(18):5324-42. doi: 10.1016/j.bmc.2012.04.062. Epub 2012 May 5.

Classification of anti-HIV compounds using counterpropagation artificial neural networks and decision trees.使用对传播人工神经网络和决策树对抗 HIV 化合物进行分类。

SAR QSAR Environ Res. 2011 Oct;22(7-8):639-60. doi: 10.1080/1062936X.2011.623318. Epub 2011 Oct 14.

引用本文的文献

The Experimentalist's Guide to Machine Learning for Small Molecule Design.小分子设计机器学习的实验者指南。

ACS Appl Bio Mater. 2024 Feb 19;7(2):657-684. doi: 10.1021/acsabm.3c00054. Epub 2023 Aug 3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于药物发现的分类树的采集。

Harvesting classification trees for drug discovery.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献