Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Leiden, The Netherlands.
Big Data. 2018 Dec;6(4):248-261. doi: 10.1089/big.2018.0070. Epub 2018 Nov 13.
This article focuses on the performance of runners in official races. Based on extensive public data from participants of races organized by the Boston Athletic Association, we demonstrate how different pacing profiles can affect the performance in a race. An athlete's pacing profile refers to the running speed at various stages of the race. We aim to provide practical, data-driven advice for professional as well as recreational runners. Our data collection covers 3 years of data made public by the race organizers, and primarily concerns the times at various intermediate points, giving an indication of the speed profile of the individual runner. We consider the 10 km, half marathon, and full marathon, leading to a data set of 120,472 race results. Although these data were not primarily recorded for scientific analysis, we demonstrate that valuable information can be gleaned from these substantial data about the right way to approach a running challenge. In this article, we focus on the role of race distance, gender, age, and the pacing profile. Since age is a crucial but complex determinant of performance, we first model the age effect in a gender- and distance-specific manner. We consider polynomials of high degree and use cross-validation to select models that are both accurate and of sufficient generalizability. After that, we perform clustering of the race profiles to identify the dominant pacing profiles that runners select. Finally, after having compensated for age influences, we apply a descriptive pattern mining approach to select reliable and informative aspects of pacing that most determine an optimal performance. The mining paradigm produces relatively simple and readable patterns, such that both professionals and amateurs can use the results to their benefit.
本文主要关注正式比赛中跑步者的表现。基于波士顿田径协会组织的比赛参与者的广泛公开数据,我们展示了不同的配速方案如何影响比赛成绩。运动员的配速方案是指在比赛各个阶段的跑步速度。我们旨在为专业和业余跑步者提供实用的、数据驱动的建议。我们的数据收集涵盖了比赛组织者公开的 3 年数据,主要涉及各个中间点的时间,这反映了跑步者的速度分布。我们考虑了 10 公里、半程马拉松和全程马拉松,这导致了一个包含 120472 个比赛结果的数据集。尽管这些数据并非主要为科学分析而记录,但我们证明,从这些大量数据中可以提取出有关正确应对跑步挑战的有价值信息。在本文中,我们重点研究比赛距离、性别、年龄和配速方案的作用。由于年龄是表现的关键但复杂的决定因素,我们首先以性别和距离特异性的方式对年龄效应进行建模。我们考虑了高次多项式,并使用交叉验证来选择既准确又具有足够通用性的模型。之后,我们对比赛方案进行聚类,以识别跑步者选择的主要配速方案。最后,在补偿年龄影响后,我们应用描述性模式挖掘方法来选择对最佳表现最有决定意义的可靠且有信息量的配速方面。挖掘范式产生了相对简单和易读的模式,因此专业人士和业余爱好者都可以从中受益。