Institute of Clinical Pharmacology, Goethe-University, Frankfurt am Main, Germany.
Fraunhofer Institute for Translational Medicine and Pharmacology ITMP, Frankfurt am Main, Germany.
CPT Pharmacometrics Syst Pharmacol. 2021 Nov;10(11):1371-1381. doi: 10.1002/psp4.12704. Epub 2021 Oct 1.
The evaluation of pharmacological data using machine learning requires high data quality. Therefore, data preprocessing, that is, cleaning analytical laboratory errors, replacing missing values or outliers, and transforming data adequately before actual data analysis, is crucial. Because current tools available for this purpose often require programming skills, preprocessing tools with graphical user interfaces that can be used interactively are needed. In collaboration between data scientists and experts in bioanalytical diagnostics, a graphical software package for data preprocessing called pguIMP is proposed, which contains a fixed sequence of preprocessing steps to enable reproducible interactive data preprocessing. As an R-based package, it also allows direct integration into this data science environment without requiring any programming knowledge. The implementation of contemporary data processing methods, including machine-learning-based imputation techniques, ensures the generation of corrected and cleaned bioanalytical data sets that preserve data structures such as clusters better than is possible with classical methods. This was evaluated on bioanalytical data sets from lipidomics and drug research using k-nearest-neighbors-based imputation followed by k-means clustering and density-based spatial clustering of applications with noise. The R package provides a Shiny-based web interface designed to be easy to use for non-data analysis experts. It is demonstrated that the spectrum of methods provided is suitable as a standard pipeline for preprocessing bioanalytical data in biomedical research domains. The R package pguIMP is freely available at the comprehensive R archive network (https://cran.r-project.org/web/packages/pguIMP/index.html).
使用机器学习评估药理学数据需要高质量的数据。因此,数据预处理(即在实际数据分析之前,清理分析实验室误差、替换缺失值或异常值,并适当转换数据)至关重要。由于目前为此目的提供的工具通常需要编程技能,因此需要具有图形用户界面的预处理工具,可以进行交互式使用。在数据科学家和生物分析诊断专家的合作下,提出了一个名为 pguIMP 的用于数据预处理的图形软件包,其中包含一系列固定的预处理步骤,以实现可重复的交互式数据预处理。作为一个基于 R 的软件包,它还允许直接集成到此数据科学环境中,而无需任何编程知识。实现现代数据处理方法,包括基于机器学习的插补技术,可确保生成经过校正和清理的生物分析数据集,这些数据集比经典方法更好地保留了数据结构,例如聚类。这是通过使用基于 k-最近邻的插补,然后进行 k-均值聚类和基于密度的空间聚类应用的噪声评估在脂质组学和药物研究的生物分析数据集上完成的。R 软件包提供了一个基于 Shiny 的 Web 界面,旨在为非数据分析专家提供易用性。结果表明,所提供的方法范围适合作为生物医学研究领域生物分析数据预处理的标准流水线。R 软件包 pguIMP 可在综合 R 档案网络上免费获得(https://cran.r-project.org/web/packages/pguIMP/index.html)。