Sieg Jochen, Feldmann Christian W, Hemmerich Jennifer, Stork Conrad, Sandfort Frederik, Eiden Philipp, Mathea Miriam
BASF SE, Ludwigshafen, 67056, Germany.
J Chem Inf Model. 2024 Dec 23;64(24):9027-9033. doi: 10.1021/acs.jcim.4c00863. Epub 2024 Sep 17.
The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to cheminformatics by wrapping standard RDKit functionality, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule object. We aimed to build an easy-to-use Python package to create completely automated end-to-end pipelines that scale to large data sets. Particular emphasis was put on handling erroneous instances, where resolution would require manual intervention in default pipelines. MolPipeline provides the building blocks to enable seamless integration of common cheminformatics tasks within scikit-learn's pipeline framework, such as scaffold splits and molecular standardization, making pipeline building easily adaptable to diverse project requirements.
开源软件包scikit-learn提供了各种机器学习算法和数据处理工具,包括Pipeline类,它允许用户在机器学习模型之前添加自定义数据转换步骤。我们引入了MolPipeline软件包,通过包装标准的RDKit功能(如读取和写入SMILES字符串或从分子对象计算分子描述符)将这一概念扩展到化学信息学领域。我们旨在构建一个易于使用的Python软件包,以创建可扩展到大型数据集的完全自动化的端到端管道。特别强调了处理错误实例,在默认管道中解决这些错误需要人工干预。MolPipeline提供了构建模块,以实现常见化学信息学任务在scikit-learn管道框架内的无缝集成,如支架拆分和分子标准化,使管道构建能够轻松适应不同的项目需求。