Open Research Newcastle
Browse

Outage Prediction and Diagnosis for Cloud Service Systems

Download (1.58 MB)
conference contribution
posted on 2025-05-09, 20:06 authored by Yujun Chen, Hongyu ZhangHongyu Zhang, Zhangwei Xu, Yingnong Dang, Xian Yang, Qingwei Lin, Hang Dong, Yong Xu, Hao Li, Yu Kang, Feng Gao
With the rapid growth of cloud service systems and their increasing complexity, service failures become unavoidable. Outages, which are critical service failures, could dramatically degrade system availability and impact user experience. To minimize service downtime and ensure high system availability, we develop an intelligent outage management approach, called AirAlert, which can forecast the occurrence of outages before they actually happen and diagnose the root cause after they indeed occur. AirAlert works as a global watcher for the entire cloud system, which collects all alerting signals, detects dependency among signals and proactively predicts outages that may happen anywhere in the whole cloud system. We analyze the relationships between outages and alerting signals by leveraging Bayesian network and predict outages using a robust gradient boosting tree based classification method. The proposed outage management approach is evaluated using the outage dataset collected from a Microsoft cloud system and the results confirm the effectiveness of the proposed approach.

History

Source title

Proceedings of The Web Conference 2019

Name of conference

WWW '19: The Web Conference

Location

San Francisco, CA

Start date

2019-05-13

End date

2019-05-17

Pagination

2659-2665

Publisher

Association for Computing Machinery

Place published

New York, NY

Language

  • en, English

College/Research Centre

Faculty of Engineering and Built Environment

School

School of Electrical Engineering and Computer Science

Rights statement

This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) (http://creativecommons.org/licenses/by-nc-nd/4.0/) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution.

Usage metrics

    Publications

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC