Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia.
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America.
PLoS Comput Biol. 2021 Mar 4;17(3):e1008671. doi: 10.1371/journal.pcbi.1008671. eCollection 2021 Mar.
Overfitting is one of the critical problems in developing models by machine learning. With machine learning becoming an essential technology in computational biology, we must include training about overfitting in all courses that introduce this technology to students and practitioners. We here propose a hands-on training for overfitting that is suitable for introductory level courses and can be carried out on its own or embedded within any data science course. We use workflow-based design of machine learning pipelines, experimentation-based teaching, and hands-on approach that focuses on concepts rather than underlying mathematics. We here detail the data analysis workflows we use in training and motivate them from the viewpoint of teaching goals. Our proposed approach relies on Orange, an open-source data science toolbox that combines data visualization and machine learning, and that is tailored for education in machine learning and explorative data analysis.
过拟合是机器学习模型开发中的关键问题之一。随着机器学习成为计算生物学中的一项重要技术,我们必须在向学生和从业者介绍该技术的所有课程中纳入关于过拟合的培训。我们在这里提出了一种适用于入门级课程的过拟合实践培训,可以独立进行,也可以嵌入任何数据科学课程中。我们使用基于工作流程的机器学习管道设计、基于实验的教学和注重概念而不是基础数学的实践方法。我们在这里详细介绍我们在培训中使用的数据分析工作流程,并从教学目标的角度对其进行说明。我们提出的方法依赖于 Orange,这是一个开源的数据科学工具箱,它结合了数据可视化和机器学习,并且专门针对机器学习和探索性数据分析的教育而设计。