Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK.
J Chem Inf Model. 2012 Oct 22;52(10):2494-500. doi: 10.1021/ci200303m. Epub 2012 Sep 17.
A plethora of articles on naive Bayes classifiers, where the chemical compounds to be classified are represented by binary-valued (absent or present type) descriptors, have appeared in the cheminformatics literature over the past decade. The principal goal of this paper is to describe how a naive Bayes classifier based on binary descriptors (NBCBBD) can be employed as a feature selector in an efficient manner suitable for cheminformatics. In the process, we point out a fact well documented in other disciplines that NBCBBD is a linear classifier and is therefore intrinsically suboptimal for classifying compounds that are nonlinearly separable in their binary descriptor space. We investigate the performance of the proposed algorithm on classifying a subset of the MDDR data set, a standard molecular benchmark data set, into active and inactive compounds.
过去十年,化学信息学文献中出现了大量关于朴素贝叶斯分类器的文章,其中待分类的化学化合物由二值(存在或不存在类型)描述符表示。本文的主要目标是描述如何以一种有效的方式将基于二值描述符的朴素贝叶斯分类器(NBCBBD)用作化学信息学中的特征选择器。在这个过程中,我们指出了一个在其他学科中已有充分记录的事实,即 NBCBBD 是一种线性分类器,因此对于在其二值描述符空间中非线性可分的化合物的分类,它本质上是次优的。我们研究了所提出的算法在将 MDDR 数据集的一个子集(标准分子基准数据集)分类为活性和非活性化合物方面的性能。