When Life Gives You Oranges: Detecting and Diagnosing Intermittent Job Failures at Mozilla

Lampel, Johannes and Just, Sascha and Apel, Sven and Zeller, Andreas
(2021) When Life Gives You Oranges: Detecting and Diagnosing Intermittent Job Failures at Mozilla.
In: ESEC/FSE 2021.
Conference: ESEC/FSE European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (formerly listed as ESEC)

[img]
Preview
Text
LJA+21.pdf - Accepted Version

Download (2MB) | Preview
Official URL: https://doi.org/10.1145/3468264.3473931

Abstract

Continuous delivery of cloud systems requires constant running of jobs (build processes, tests, etc.). One issue that plagues this continuous integration (CI) process are intermittent failures - non-deterministic, false alarms that do not result from a bug in the software or job specification, but rather from issues in the underlying infrastructure. At Mozilla, such intermittent failures are called oranges as a reference to the color of the build status indicator. As such intermittent failures disrupt CI and lead to failures, they erode the developers' trust in the jobs. We present a novel approach that automatically classifies failing jobs to determine whether job execution failures arise from an actual software bug or were caused by flakiness in the job (e.g., test) or the underlying infrastructure. For this purpose, we train classification models using job telemetry data to diagnose failure patterns involving features such as runtime, cpu load, operating system version, or specific platform with high precision. In an evaluation on a set of Mozilla CI jobs, our approach achieves precision scores of 73%, on average, across all data sets with some test suites achieving precision scores good enough for fully automated classification (i.e., precision scores of up to 100%), and recall scores of 82% on average (up to 94%).

Actions

Actions (login required)

View Item View Item