Open Research Newcastle
Browse

An empirical study on program failures of deep learning jobs

Download (2.12 MB)
conference contribution
posted on 2025-05-10, 17:41 authored by Hongyu ZhangHongyu Zhang, Wencong Xiao, Yu Liu, Haoxiang Lin, Mao Yang
Deep learning has made significant achievements in many application areas. To train and test models more efficiently, enterprise developers submit and run their deep learning programs on a shared, multi-tenant platform. However, some of the programs fail after a long execution time due to code/script defects, which reduces the development productivity and wastes expensive resources such as GPU, storage, and network I/O.This paper presents the first comprehensive empirical study on program failures of deep learning jobs. 4960 real failures are collected from a deep learning platform in Microsoft. We manually examine their failure messages and classify them into 20 categories. In addition, we identify the common root causes and bug-fix solutions on a sample of 400 failures. To better understand the current testing and debugging practices for deep learning, we also conduct developer interviews. Our major findings include: (1) 48.0% of the failures occur in the interaction with the platform rather than in the execution of code logic, mostly due to the discrepancies between local and platform execution environments; (2) Deep learning specific failures (13.5%) are mainly caused by inappropriate model parameters/structures and framework API misunderstanding; (3) Current debugging practices are not efficient for fault localization in many cases, and developers need more deep learning specific tools. Based on our findings, we further suggest possible research topics and tooling support that could facilitate future deep learning development.

Funding

ARC

DP200102940

History

Source title

Proceedings of 2020 ACM/IEEE 42nd International Conference on Software Engineering (ICSE)

Name of conference

2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE)

Location

Seoul, South Korea

Start date

2020-06-27

End date

2020-07-19

Pagination

1159-1170

Publisher

Association for Computing Machinery

Place published

New York

Language

  • en, English

College/Research Centre

Faculty of Engineering and Built Environment

School

School of Electrical Engineering and Computer Science

Rights statement

© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Usage metrics

    Publications

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC