Suppr超能文献

使用支持向量机、递归划分和拉普拉斯修正朴素贝叶斯分类器,对噪声水平不断增加的高通量筛选数据进行富集。

Enrichment of high-throughput screening data with increasing levels of noise using support vector machines, recursive partitioning, and laplacian-modified naive bayesian classifiers.

作者信息

Glick Meir, Jenkins Jeremy L, Nettles James H, Hitchings Hamilton, Davies John W

机构信息

Lead Discovery Center, Novartis Institutes for Biomedical Research Inc., Cambridge, Massachusetts 02139, USA.

出版信息

J Chem Inf Model. 2006 Jan-Feb;46(1):193-200. doi: 10.1021/ci050374h.

Abstract

High-throughput screening (HTS) plays a pivotal role in lead discovery for the pharmaceutical industry. In tandem, cheminformatics approaches are employed to increase the probability of the identification of novel biologically active compounds by mining the HTS data. HTS data is notoriously noisy, and therefore, the selection of the optimal data mining method is important for the success of such an analysis. Here, we describe a retrospective analysis of four HTS data sets using three mining approaches: Laplacian-modified naive Bayes, recursive partitioning, and support vector machine (SVM) classifiers with increasing stochastic noise in the form of false positives and false negatives. All three of the data mining methods at hand tolerated increasing levels of false positives even when the ratio of misclassified compounds to true active compounds was 5:1 in the training set. False negatives in the ratio of 1:1 were tolerated as well. SVM outperformed the other two methods in capturing active compounds and scaffolds in the top 1%. A Murcko scaffold analysis could explain the differences in enrichments among the four data sets. This study demonstrates that data mining methods can add a true value to the screen even when the data is contaminated with a high level of stochastic noise.

摘要

高通量筛选(HTS)在制药行业的先导化合物发现中起着关键作用。与此同时,化学信息学方法被用于通过挖掘高通量筛选数据来提高发现新型生物活性化合物的概率。高通量筛选数据的噪声很大,因此,选择最佳的数据挖掘方法对于此类分析的成功至关重要。在此,我们描述了使用三种挖掘方法对四个高通量筛选数据集进行的回顾性分析:拉普拉斯修正朴素贝叶斯、递归划分以及支持向量机(SVM)分类器,其中误报和漏报形式的随机噪声不断增加。即便训练集中误分类化合物与真正活性化合物的比例为5:1,现有的这三种数据挖掘方法都能容忍不断增加的误报水平。1:1比例的漏报也能被容忍。在捕获排名前1%的活性化合物和骨架方面,支持向量机的表现优于其他两种方法。默克分子骨架分析可以解释四个数据集之间富集程度的差异。这项研究表明,即使数据被高水平的随机噪声污染,数据挖掘方法也能为筛选增添真正的价值。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验