Smart Technology Research Centre, Bournemouth University, Poole House, Talbot Campus, Poole, Dorset, BH12 5BB, UK.
J Cheminform. 2009 Dec 22;1:21. doi: 10.1186/1758-2946-1-21.
There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets.
Pharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance.
Understandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.
虚拟筛选生物测定数据存在三个主要问题。第一个是获取免费的经过整理的数据,第二个是物理初级筛选过程中出现的大量假阳性,最后是数据高度不平衡,活性化合物与非活性化合物的比例很低。本文首先讨论了这三个问题,然后选择了几种 Weka 代价敏感分类器(朴素贝叶斯、支持向量机、C4.5 和随机森林)应用于各种生物测定数据集。
制药生物测定数据不容易为学术界所获得。PubChem 所保存的数据未经整理,并且在初级筛选和确认筛选之间缺乏详细的交叉参考。至于在初级筛选过程中出现的大量假阳性,由于上面提到的缺乏交叉参考,所进行的分析还很肤浅。在发现的六个案例中,高通量初级筛选的假阳性平均百分比相当高,为 64%。对于代价敏感分类,Weka 的支持向量机和 C4.5 决策树学习者的实现表现相对较好。还发现,Weka 代价矩阵的设置取决于所使用的基础分类器,而不仅仅取决于类别不平衡的比例。
可以理解的是,制药数据很难获得。然而,为初级筛选和相应的确认数据提供经过整理的信息将对制药行业和学术界都有好处。将虚拟筛选技术应用于生物测定数据可以带来两个好处。首先,通过减少要筛选的化合物的搜索空间,其次,通过分析初级筛选过程中出现的假阳性,可以改进该技术。初级筛选产生的大量假阳性导致了是否应该使用此类数据进行虚拟筛选的问题。在使用 Weka 的代价敏感分类器时需要小心——在比较同一数据集的不同分类器时,不应该基于类别比例使用一刀切的错误分类成本。