Indiana University School of Informatics and Computing, Bloomington, IN, 47408, USA.
J Cheminform. 2012 Nov 23;4(1):29. doi: 10.1186/1758-2946-4-29.
Relating chemical features to bioactivities is critical in molecular design and is used extensively in the lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) methods, and produce highly interpretable models.
将化学特征与生物活性相关联在分子设计中至关重要,并且在发现和优化先导化合物的过程中得到了广泛应用。统计、数据挖掘和机器学习领域的各种技术都已经应用于这一过程。在本研究中,我们利用了一系列被称为关联分类挖掘(ACM)的方法,这些方法在数据挖掘领域很受欢迎,但到目前为止还没有在化学信息学中得到广泛应用。更具体地说,我们使用了基于预测关联规则的分类(CPAR)、基于多关联规则的分类(CMAR)和基于关联规则的分类(CBA)三种方法,在三个数据集上使用了不同的描述符集。对抗结核(antiTB)、致突变性和 hERG(人类 Ether-a-go-go-Related Gene)阻滞剂数据集的实验评估表明,这三种方法在计算上具有可扩展性,适合高速挖掘。此外,它们提供了与常用的贝叶斯和支持向量机(SVM)方法相当的准确性和效率,并产生了高度可解释的模型。