MolPipeline：一个用于在Scikit-learn中使用RDKit处理分子的Python包。

MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-learn.

作者信息

Sieg Jochen, Feldmann Christian W, Hemmerich Jennifer, Stork Conrad, Sandfort Frederik, Eiden Philipp, Mathea Miriam

机构信息

BASF SE, Ludwigshafen, 67056, Germany.

出版信息

J Chem Inf Model. 2024 Dec 23;64(24):9027-9033. doi: 10.1021/acs.jcim.4c00863. Epub 2024 Sep 17.

DOI:10.1021/acs.jcim.4c00863

PMID:39288001

Abstract

The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to cheminformatics by wrapping standard RDKit functionality, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule object. We aimed to build an easy-to-use Python package to create completely automated end-to-end pipelines that scale to large data sets. Particular emphasis was put on handling erroneous instances, where resolution would require manual intervention in default pipelines. MolPipeline provides the building blocks to enable seamless integration of common cheminformatics tasks within scikit-learn's pipeline framework, such as scaffold splits and molecular standardization, making pipeline building easily adaptable to diverse project requirements.

摘要

开源软件包scikit-learn提供了各种机器学习算法和数据处理工具，包括Pipeline类，它允许用户在机器学习模型之前添加自定义数据转换步骤。我们引入了MolPipeline软件包，通过包装标准的RDKit功能（如读取和写入SMILES字符串或从分子对象计算分子描述符）将这一概念扩展到化学信息学领域。我们旨在构建一个易于使用的Python软件包，以创建可扩展到大型数据集的完全自动化的端到端管道。特别强调了处理错误实例，在默认管道中解决这些错误需要人工干预。MolPipeline提供了构建模块，以实现常见化学信息学任务在scikit-learn管道框架内的无缝集成，如支架拆分和分子标准化，使管道构建能够轻松适应不同的项目需求。

相似文献

MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-learn.

J Chem Inf Model. 2024 Dec 23;64(24):9027-9033. doi: 10.1021/acs.jcim.4c00863. Epub 2024 Sep 17.

NeuroPycon: An open-source python toolbox for fast multi-modal and reproducible brain connectivity pipelines.

Neuroimage. 2020 Oct 1;219:117020. doi: 10.1016/j.neuroimage.2020.117020. Epub 2020 Jun 6.

SciKit Digital Health: Python Package for Streamlined Wearable Inertial Sensor Data Processing.

JMIR Mhealth Uhealth. 2022 Apr 21;10(4):e36762. doi: 10.2196/36762.

ChemSuite: A package for chemoinformatics calculations and machine learning.

Chem Biol Drug Des. 2019 May;93(5):960-964. doi: 10.1111/cbdd.13479. Epub 2019 Mar 7.

GUIDEMOL: A Python graphical user interface for molecular descriptors based on RDKit.

Mol Inform. 2024 Jan;43(1):e202300190. doi: 10.1002/minf.202300190. Epub 2023 Nov 20.

A Framework for the Optimization of Complex Cyber-Physical Systems via Directed Acyclic Graph.

Sensors (Basel). 2022 Feb 15;22(4):1490. doi: 10.3390/s22041490.

Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation.

Entropy (Basel). 2021 Oct 19;23(10):1368. doi: 10.3390/e23101368.

CoreFlow: a computational platform for integration, analysis and modeling of complex biological data.

J Proteomics. 2014 Apr 4;100:167-73. doi: 10.1016/j.jprot.2014.01.023. Epub 2014 Feb 3.

RDCanon: A Python Package for Canonicalizing the Order of Tokens in SMARTS Queries.

J Chem Inf Model. 2024 Apr 22;64(8):2948-2954. doi: 10.1021/acs.jcim.4c00138. Epub 2024 Mar 15.

Performance prediction of polymer-fullerene organic solar cells and data mining-assisted designing of new polymers.

J Mol Model. 2023 Aug 2;29(8):270. doi: 10.1007/s00894-023-05677-3.

引用本文的文献

Targeting Poly (ADP-ribose) polymerase-1 (PARP-1) for DNA repair mechanism through QSAR-based virtual screening and MD simulation.

Mol Divers. 2025 Apr 14. doi: 10.1007/s11030-025-11184-9.

Machine learning-based screening and molecular simulations for discovering novel PARP-1 inhibitors targeting DNA repair mechanisms for breast cancer therapy.

Mol Divers. 2025 Feb 3. doi: 10.1007/s11030-025-11119-4.

Deepmol: an automated machine and deep learning framework for computational chemistry.

J Cheminform. 2024 Dec 5;16(1):136. doi: 10.1186/s13321-024-00937-7.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

MolPipeline：一个用于在Scikit-learn中使用RDKit处理分子的Python包。

MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-learn.

作者信息

Sieg Jochen, Feldmann Christian W, Hemmerich Jennifer, Stork Conrad, Sandfort Frederik, Eiden Philipp, Mathea Miriam

机构信息

BASF SE, Ludwigshafen, 67056, Germany.

出版信息

J Chem Inf Model. 2024 Dec 23;64(24):9027-9033. doi: 10.1021/acs.jcim.4c00863. Epub 2024 Sep 17.

DOI:10.1021/acs.jcim.4c00863

PMID:39288001

Abstract

摘要

MolPipeline：一个用于在Scikit-learn中使用RDKit处理分子的Python包。

MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-learn.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

MolPipeline：一个用于在Scikit-learn中使用RDKit处理分子的Python包。

MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-learn.

作者信息

机构信息

出版信息

相似文献

引用本文的文献