van den Maagdenberg Helle W, Šícho Martin, Araripe David Alencar, Luukkonen Sohvi, Schoenmaker Linde, Jespers Michiel, Béquignon Olivier J M, González Marina Gorostiola, van den Broek Remco L, Bernatavicius Andrius, van Hasselt J G Coen, van der Graaf Piet H, van Westen Gerard J P
Computational Drug Discovery, Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, Leiden, 2333 CC, The Netherlands.
CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, Prague, A-4040, Czech Republic.
J Cheminform. 2024 Nov 14;16(1):128. doi: 10.1186/s13321-024-00908-y.
Building reliable and robust quantitative structure-property relationship (QSPR) models is a challenging task. First, the experimental data needs to be obtained, analyzed and curated. Second, the number of available methods is continuously growing and evaluating different algorithms and methodologies can be arduous. Finally, the last hurdle that researchers face is to ensure the reproducibility of their models and facilitate their transferability into practice. In this work, we introduce QSPRpred, a toolkit for analysis of bioactivity data sets and QSPR modelling, which attempts to address the aforementioned challenges. QSPRpred's modular Python API enables users to intuitively describe different parts of a modelling workflow using a plethora of pre-implemented components, but also integrates customized implementations in a "plug-and-play" manner. QSPRpred data sets and models are directly serializable, which means they can be readily reproduced and put into operation after training as the models are saved with all required data pre-processing steps to make predictions on new compounds directly from SMILES strings. The general-purpose character of QSPRpred is also demonstrated by inclusion of support for multi-task and proteochemometric modelling. The package is extensively documented and comes with a large collection of tutorials to help new users. In this paper, we describe all of QSPRpred's functionalities and also conduct a small benchmarking case study to illustrate how different components can be leveraged to compare a diverse set of models. QSPRpred is fully open-source and available at https://github.com/CDDLeiden/QSPRpred .Scientific ContributionQSPRpred aims to provide a complex, but comprehensive Python API to conduct all tasks encountered in QSPR modelling from data preparation and analysis to model creation and model deployment. In contrast to similar packages, QSPRpred offers a wider and more exhaustive range of capabilities and integrations with many popular packages that also go beyond QSPR modelling. A significant contribution of QSPRpred is also in its automated and highly standardized serialization scheme, which significantly improves reproducibility and transferability of models.
构建可靠且稳健的定量结构-性质关系(QSPR)模型是一项具有挑战性的任务。首先,需要获取、分析和整理实验数据。其次,可用方法的数量在不断增加,评估不同的算法和方法可能很艰巨。最后,研究人员面临的最后一个障碍是确保其模型的可重复性,并促进其向实际应用的可转移性。在这项工作中,我们引入了QSPRpred,这是一个用于生物活性数据集分析和QSPR建模的工具包,旨在应对上述挑战。QSPRpred的模块化Python API使用户能够使用大量预先实现的组件直观地描述建模工作流程的不同部分,同时还以“即插即用”的方式集成定制实现。QSPRpred数据集和模型是直接可序列化的,这意味着在训练后它们可以很容易地被重现并投入使用,因为模型在保存时包含了所有所需的数据预处理步骤,以便直接从SMILES字符串对新化合物进行预测。对多任务和蛋白质化学计量学建模的支持也证明了QSPRpred的通用性。该软件包有大量文档,并附带大量教程以帮助新用户。在本文中,我们描述了QSPRpred的所有功能,并进行了一个小型基准案例研究,以说明如何利用不同组件来比较各种模型。QSPRpred是完全开源的,可在https://github.com/CDDLeiden/QSPRpred获取。科学贡献QSPRpred旨在提供一个复杂但全面的Python API,以执行QSPR建模中遇到的所有任务,从数据准备和分析到模型创建和模型部署。与类似的软件包相比,QSPRpred提供了更广泛、更详尽的功能范围,并与许多流行软件包进行了集成,这些集成也超出了QSPR建模的范围。QSPRpred的一个重要贡献还在于其自动化且高度标准化的序列化方案,这显著提高了模型的可重复性和可转移性。