(2021) When Life Gives You Oranges: Detecting and Diagnosing Intermittent Job Failures at Mozilla.
|
Text
LJA+21.pdf - Accepted Version Download (2MB) | Preview |
Abstract
Continuous delivery of cloud systems requires constant running of jobs (build processes, tests, etc.). One issue that plagues this continuous integration (CI) process are intermittent failures - non-deterministic, false alarms that do not result from a bug in the software or job specification, but rather from issues in the underlying infrastructure. At Mozilla, such intermittent failures are called oranges as a reference to the color of the build status indicator. As such intermittent failures disrupt CI and lead to failures, they erode the developers' trust in the jobs. We present a novel approach that automatically classifies failing jobs to determine whether job execution failures arise from an actual software bug or were caused by flakiness in the job (e.g., test) or the underlying infrastructure. For this purpose, we train classification models using job telemetry data to diagnose failure patterns involving features such as runtime, cpu load, operating system version, or specific platform with high precision. In an evaluation on a set of Mozilla CI jobs, our approach achieves precision scores of 73%, on average, across all data sets with some test suites achieving precision scores good enough for fully automated classification (i.e., precision scores of up to 100%), and recall scores of 82% on average (up to 94%).
Item Type: | Conference or Workshop Item (A Paper) (Paper) |
---|---|
Conference: | ESEC/FSE European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (formerly listed as ESEC) |
Depositing User: | Johannes Lampel |
Date Deposited: | 22 Sep 2021 10:50 |
Last Modified: | 22 Sep 2021 10:50 |
Primary Research Area: | NRA5: Empirical & Behavioral Security |
URI: | https://publications.cispa.saarland/id/eprint/3474 |
Actions
Actions (login required)
View Item |