Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, USA.
J Chem Inf Model. 2011 Mar 28;51(3):521-31. doi: 10.1021/ci100399j. Epub 2011 Mar 7.
Advanced high-throughput screening (HTS) technologies generate great amounts of bioactivity data, and this data needs to be analyzed and interpreted with attention to understand how these small molecules affect biological systems. As such, there is an increasing demand to develop and adapt cheminformatics algorithms and tools in order to predict molecular and pharmacological properties on the basis of these large data sets. In this manuscript, we report a novel machine-learning-based ligand classification algorithm, named Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS), for data-mining and modeling of large chemical data sets to predict pharmacological properties in an efficient and accurate manner. The performance of LiCABEDS was evaluated through predicting GPCR ligand functionality (agonist or antagonist) using four different molecular fingerprints, including Maccs, FP2, Unity, and Molprint 2D fingerprints. Our studies showed that LiCABEDS outperformed two other popular techniques, classification tree and Naive Bayes classifier, on all four types of molecular fingerprints. Parameters in LiCABEDS, including the number of boosting iterations, initialization condition, and a "reject option" boundary, were thoroughly explored and discussed to demonstrate the capability of handling imbalanced data sets, as well as its robustness and flexibility. In addition, the detailed mathematical concepts and theory are also given to address the principle behind statistical prediction models. The LiCABEDS algorithm has been implemented into a user-friendly software package that is accessible online at http://www.cbligand.org/LiCABEDS/ .
高通量筛选 (HTS) 技术会产生大量的生物活性数据,这些数据需要经过分析和解释,以便了解这些小分子如何影响生物系统。因此,人们越来越需要开发和适应化学信息学算法和工具,以便根据这些大数据集来预测分子和药理学性质。在本文中,我们报告了一种新的基于机器学习的配体分类算法,名为 Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS),用于挖掘和建模大型化学数据集,以高效、准确地预测药理学性质。通过使用四种不同的分子指纹图谱(Maccs、FP2、Unity 和 Molprint 2D 指纹图谱)预测 GPCR 配体功能(激动剂或拮抗剂),评估了 LiCABEDS 的性能。我们的研究表明,LiCABEDS 在所有四种类型的分子指纹图谱上的性能均优于另外两种流行的技术,即分类树和朴素贝叶斯分类器。深入探讨和讨论了 LiCABEDS 中的参数,包括提升迭代次数、初始化条件和“拒绝选项”边界,以展示其处理不平衡数据集的能力以及其稳健性和灵活性。此外,还给出了详细的数学概念和理论,以解决统计预测模型背后的原理。LiCABEDS 算法已被实现为一个用户友好的软件包,并可在 http://www.cbligand.org/LiCABEDS/ 上在线访问。