Rouzrokh Pouria, Khosravi Bardia, Faghani Shahriar, Moassefi Mana, Vera Garcia Diana V, Singh Yashbir, Zhang Kuan, Conte Gian Marco, Erickson Bradley J
Radiology Informatics Laboratory, Department of Radiology, Mayo Clinic, 200 1st St SW, Rochester, MN 55905.
Radiol Artif Intell. 2022 Aug 24;4(5):e210290. doi: 10.1148/ryai.210290. eCollection 2022 Sep.
Minimizing bias is critical to adoption and implementation of machine learning (ML) in clinical practice. Systematic mathematical biases produce consistent and reproducible differences between the observed and expected performance of ML systems, resulting in suboptimal performance. Such biases can be traced back to various phases of ML development: data handling, model development, and performance evaluation. This report presents 12 suboptimal practices during data handling of an ML study, explains how those practices can lead to biases, and describes what may be done to mitigate them. Authors employ an arbitrary and simplified framework that splits ML data handling into four steps: data collection, data investigation, data splitting, and feature engineering. Examples from the available research literature are provided. A Google Colaboratory Jupyter notebook includes code examples to demonstrate the suboptimal practices and steps to prevent them. Data Handling, Bias, Machine Learning, Deep Learning, Convolutional Neural Network (CNN), Computer-aided Diagnosis (CAD) © RSNA, 2022.
在临床实践中,尽量减少偏差对于机器学习(ML)的采用和实施至关重要。系统性数学偏差会在ML系统的观察性能和预期性能之间产生一致且可重复的差异,从而导致性能次优。此类偏差可追溯到ML开发的各个阶段:数据处理、模型开发和性能评估。本报告介绍了ML研究数据处理过程中的12种次优做法,解释了这些做法如何导致偏差,并描述了减轻偏差的措施。作者采用了一个任意且简化的框架,将ML数据处理分为四个步骤:数据收集、数据调查、数据拆分和特征工程。提供了现有研究文献中的示例。一个Google Colaboratory Jupyter笔记本包含代码示例,以演示次优做法及预防措施。数据处理、偏差、机器学习、深度学习、卷积神经网络(CNN)、计算机辅助诊断(CAD)©RSNA,2022年。