Pontillo Valeria, Palomba Fabio, Ferrucci Filomena
Software Engineering (SeSa) Lab - Department of Computer Science, University of Salerno, Fisciano, Italy.
Empir Softw Eng. 2022;27(7):187. doi: 10.1007/s10664-022-10227-1. Epub 2022 Oct 1.
Test flakiness is a phenomenon occurring when a test case is non-deterministic and exhibits both a passing and failing behavior when run against the same code. Over the last years, the problem has been closely investigated by researchers and practitioners, who all have shown its relevance in practice. The software engineering research community has been working toward defining approaches for detecting and addressing test flakiness. Despite being quite accurate, most of these approaches rely on expensive dynamic steps, e.g., the computation of code coverage information. Consequently, they might suffer from scalability issues that possibly preclude their practical use. This limitation has been recently targeted through machine learning solutions that could predict the flakiness of tests using various features, like source code vocabulary or a mixture of static and dynamic metrics computed on individual snapshots of the system. In this paper, we aim to perform a step forward and predict test flakiness . We propose a large-scale experiment on 70 Java projects coming from the iDFlakies and FlakeFlagger datasets. First, we statistically assess the differences between flaky and non-flaky tests in terms of 25 test and production code metrics and smells, analyzing both their individual and combined effects. Based on the results achieved, we experiment with a machine learning approach that predicts test flakiness solely based on static features, comparing it with two state-of-the-art approaches. The key results of the study show that the static approach has performance comparable to those of the baselines. In addition, we found that the characteristics of the production code might impact the performance of the flaky test prediction models.
测试不稳定是指当一个测试用例是非确定性的,并且在针对相同代码运行时表现出通过和失败两种行为的现象。在过去几年中,研究人员和从业人员对该问题进行了深入研究,他们都证明了其在实践中的相关性。软件工程研究界一直在努力定义检测和解决测试不稳定的方法。尽管其中大多数方法相当准确,但它们大多依赖于昂贵的动态步骤,例如代码覆盖信息的计算。因此,它们可能会遇到可扩展性问题,这可能会妨碍它们的实际应用。最近,通过机器学习解决方案针对这一局限性进行了研究,这些解决方案可以使用各种特征(如源代码词汇或根据系统的各个快照计算的静态和动态指标的混合)来预测测试的不稳定情况。在本文中,我们旨在向前迈进一步并预测测试的不稳定情况。我们对来自iDFlakies和FlakeFlagger数据集的70个Java项目进行了大规模实验。首先,我们根据25个测试和生产代码指标及代码坏味道,从统计学角度评估不稳定测试和非不稳定测试之间的差异,分析它们的个体和综合影响。基于所取得的结果,我们尝试一种仅基于静态特征预测测试不稳定情况的机器学习方法,并将其与两种最先进的方法进行比较。该研究的关键结果表明,静态方法的性能与基线方法相当。此外,我们发现生产代码的特征可能会影响不稳定测试预测模型的性能。