Department of Public Health Sciences, University of Alberta, Edmonton, Alberta T6G 1C9, Canada.
J Chem Inf Model. 2012 Dec 21;52(12):3169-80. doi: 10.1021/ci3000216. Epub 2012 Nov 16.
Millions of compounds are available as potential drug candidates. High throughput screening (HTS) is widely used in drug discovery to assay compounds for a particular biological activity. A common approach is to build a classification model using a smaller sample of assay data to predict the activity of unscreened compounds and hence select further compounds for assay. This improves the efficiency of the search by increasing the proportion of hits found among the assayed compounds. In many assays, the biological activity is dichotomized into a binary indicator variable; the explanatory variables are chemical descriptors capturing compound structure. A tree model is interpretable, which is key, since it is of interest to identify diverse chemical classes among the active compounds to serve as leads for drug optimization. Interpretability of a tree is often reduced, however, by the sheer size of the tree model and the number of variables and rules of the terminal nodes. We develop a "tree harvesting" algorithm to filter out redundant "junk" rules from the tree while retaining its predictive accuracy. This simplification can facilitate the process of uncovering key relations between molecular structure and activity and may clarify rules defining multiple activity mechanisms. Using data from the National Cancer Institute, we illustrate that many of the rules used to build a classification tree may be redundant. Unlike tree pruning, tree harvesting allows variables with junk rules to be removed near the top of the tree. The reduction in complexity of the terminal nodes improves the interpretability of the model. The algorithm also aims to reorganize the tree nodes associated with the interesting "active" class into larger, more coherent groups, thus facilitating identification of the mechanisms for activity.
数以百万计的化合物都可以作为潜在的药物候选物。高通量筛选(HTS)广泛应用于药物发现中,以检测特定生物活性的化合物。一种常见的方法是使用较小的检测数据样本构建分类模型,以预测未筛选化合物的活性,从而选择进一步进行检测的化合物。这通过增加检测化合物中发现命中的比例来提高搜索效率。在许多测定中,生物活性被二分化成二元指示变量;解释变量是捕获化合物结构的化学描述符。树模型是可解释的,这是关键的,因为在活性化合物中确定不同的化学类群作为药物优化的先导是很重要的。然而,由于树模型的大小、变量的数量以及终端节点的规则,树的可解释性往往会降低。我们开发了一种“树收割”算法,该算法可以在保留预测准确性的同时,从树中过滤掉冗余的“垃圾”规则。这种简化可以促进发现分子结构与活性之间关键关系的过程,并可能阐明定义多种活性机制的规则。我们使用国家癌症研究所的数据来说明,构建分类树所使用的许多规则可能是冗余的。与树修剪不同,树收割允许删除具有垃圾规则的变量接近树的顶部。终端节点的复杂性降低提高了模型的可解释性。该算法还旨在将与有趣的“活性”类相关的树节点重新组织成更大、更一致的组,从而有助于确定活性机制。