Suppr超能文献

使用线性支持向量机和特定于问题的指标进行大规模结构-活性关系学习。

Large-scale learning of structure-activity relationships using a linear support vector machine and problem-specific metrics.

机构信息

Center for Bioinformatics (ZBIT), University of Tübingen, Tübingen, Germany.

出版信息

J Chem Inf Model. 2011 Feb 28;51(2):203-13. doi: 10.1021/ci100073w. Epub 2011 Jan 5.

Abstract

The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.

摘要

本研究的目的是将最近提出的线性大规模支持向量机应用于大规模的二元化学信息学分类问题,并使用虚拟筛选性能指标评估其在各种基准测试上的性能。我们通过虚拟高通量筛选指标扩展了大型线性支持向量机库 LIBLINEAR,以对整个大规模和不平衡数据集进行分类器训练。如果将这种线性支持机器的公式应用于高维稀疏特征向量,则其性能表现优异。另外一个优点是预测中非零特征的数量呈平均线性复杂度。然而,该方法假设问题是线性可分的。因此,我们进行了广泛的基准测试,以评估在高达 175000 个样本的大规模问题上的性能。为了检查虚拟筛选性能,我们使用特征树确定化学型聚类,并将此信息集成到计算加权 AUC 基性能指标和聚类外交叉验证中。我们还考虑了 BEDROC 分数,这是一种建议用于解决早期富集问题的指标。通过嵌套交叉验证和嵌套聚类外交叉验证评估每个问题的性能。我们将 LIBLINEAR 与朴素贝叶斯分类器、随机决策森林分类器和最大相似度排序方法进行了比较。在直接比较中,LIBLINEAR 优于这些参考方法。与文献结果的比较表明,LIBLINEAR 的性能具有竞争力,但在这些基准测试上并未达到排名靠前的非线性机器的优异结果。然而,考虑到大规模支持向量机的整体令人信服的性能和计算时间,该方法为既定的大规模分类方法提供了一个极好的替代方案。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验