Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, CH-4056 Basel, Switzerland.
J Chem Inf Model. 2013 Nov 25;53(11):2829-36. doi: 10.1021/ci400466r. Epub 2013 Nov 14.
The concept of data fusion - the combination of information from different sources describing the same object with the expectation to generate a more accurate representation - has found application in a very broad range of disciplines. In the context of ligand-based virtual screening (VS), data fusion has been applied to combine knowledge from either different active molecules or different fingerprints to improve similarity search performance. Machine-learning (ML) methods based on fusion of multiple homogeneous classifiers, in particular random forests, have also been widely applied in the ML literature. The heterogeneous version of classifier fusion - fusing the predictions from different model types - has been less explored. Here, we investigate heterogeneous classifier fusion for ligand-based VS using three different ML methods, RF, naïve Bayes (NB), and logistic regression (LR), with four 2D fingerprints, atom pairs, topological torsions, RDKit fingerprint, and circular fingerprint. The methods are compared using a previously developed benchmarking platform for 2D fingerprints which is extended to ML methods in this article. The original data sets are filtered for difficulty, and a new set of challenging data sets from ChEMBL is added. Data sets were also generated for a second use case: starting from a small set of related actives instead of diverse actives. The final fused model consistently outperforms the other approaches across the broad variety of targets studied, indicating that heterogeneous classifier fusion is a very promising approach for ligand-based VS. The new data sets together with the adapted source code for ML methods are provided in the Supporting Information .
数据融合的概念——将来自不同来源的描述同一对象的信息进行组合,以期生成更准确的表示——已经在非常广泛的学科领域得到了应用。在基于配体的虚拟筛选(VS)中,数据融合已经被应用于结合来自不同活性分子或不同指纹的知识,以提高相似性搜索性能。基于融合多个同类分类器的机器学习(ML)方法,尤其是随机森林,在 ML 文献中也得到了广泛应用。基于不同模型类型的预测融合的异类分类器融合——融合来自不同模型类型的预测——则较少被探索。在这里,我们使用三种不同的 ML 方法——随机森林(RF)、朴素贝叶斯(NB)和逻辑回归(LR)——结合四个 2D 指纹(原子对、拓扑扭转、RDKit 指纹和环形指纹),研究基于配体的 VS 的异类分类器融合。该方法使用以前开发的用于 2D 指纹的基准测试平台进行比较,并在本文中扩展到 ML 方法。原始数据集根据难度进行过滤,并添加了一组来自 ChEMBL 的新具有挑战性的数据集。数据集还被生成用于第二个用例:从一小部分相关的活性物质而不是多样化的活性物质开始。最终的融合模型在研究的广泛目标中始终优于其他方法,表明异类分类器融合是基于配体的 VS 的一种非常有前途的方法。新数据集以及适用于 ML 方法的改编源代码都在支持信息中提供。