en
Get Started

How a Multinational Telecommunications Company Developed AI-Enabled Service Failure Prediction

The issue of service downtime and failure (which can lead to significant revenue loss) is specifically prevalent amongst telcos, but this problem can be solved by using machine learning to predict what might cause service failures and prevent them from happening.
Watch Video

What Constitutes a Service Failure?

The center of excellence in the company’s global IT operations division comprises ML engineers and data scientists who handle data onboarding, pipelining, modeling, and building reusable frameworks. According to the team, service failure isn’t easily encapsulated by “service down.” Most systems fail in a series of steps of degradation and don’t just turn off like an outage. They started off returning, for example, HTTP 500 responses (from a server to customer to define failure) but soon realized that it was not always actually a service problem but sometimes a user problem. 

Alongside the technical challenge of training and deploying complex models, it is imperative to get the right collaboration from the operations subject matter experts (SMEs). In the service failure prediction use case, the SME defines the “failures” they want to predict, which is a challenging task as the failures differ from architecture to architecture and solution to solution. 

Using Dataiku, the data team was able to put potential failure characteristics into a catalog to increase the efficiency of the operations SMEs’ job of identifying and clarifying what “failure” or “degradation” means. Eventually, the team got the collaboration process efficient enough that SMEs could define 20 models in hours — all just to get the meaning of “failure,” which is what is getting predicted in the use case.

The team knew they wanted this to become a self-service initiative over time. To make that a reality, the IT operations managers and SMEs in the global IT division have access to the reusable frameworks from the central product team. Now, they can extract insights autonomously and collaborate with the technical experts to establish a scalable solution for defining (and predicting) a failure.

Results: Reduced Service Failures, Faster Model Development, & More

Before using Dataiku, the data scientists were using manual feature engineering for each model which was very time consuming. They started using deep learning approaches and found better accuracy, which simultaneously created time that data scientists could spend on other high-priority projects. A deep learning model in Dataiku can now take less than 20 minutes to train on months of data. The catalog of indicators described above did a lot of heavy lifting to help the team identify the right failures to go after, with input of SMEs balancing failure rate with impact. 

The team can now generate models for about 20 components in less than a month, covering everything from data preprocessing, data transfer from logs, modeling, automation, and testing. The models themselves and production are monitored with Dataiku, and the company also has a business layer of monitoring, ensuring that the actionable data created for the business owners of each IT service is useful and understandable.

Additionally, the team has seen:

  • Accelerated speed to market for model development, from when the data becomes available to deployment into the live environment: Previously, one model took the team six months and, with Dataiku, they can now produce 40 models within 6 weeks (meaning 40 IT components are now being monitored in this innovative way) 
  • Reduced MTTR (mean time to resolution/restore), which enables the team to fix failures faster 
  • Increased service availability 
  • Reduced P1 service failures and enabled a quicker intervention time (i.e., the average major incident is now predicted 50 minutes in advance, giving the business time to proactively address it)  
  • Efficiencies with time saved across the end-to-end failure process, so the team can focus on the priorities of IT operations, investigating further preventative measures using the time released from manually monitoring service failures
  • Greater agility upon leveraging the power of Dataiku and the cloud in unison (without having to wait for on-prem infrastructure) and, therefore, the ability to build models in a more extensible way 

The team is looking forward to scaling out the service failure prediction use case and experimenting with auto-diagnostics to avoid creating panic with services that have been identified as on their way to failure via prescriptive resolutions.

Orange: Building a Sustainable Data Practice

Armed with Dataiku, Orange was able to start transitioning smaller BI projects to the business and work on machine learning use cases like call load detection and triage, a model that took less than a month for the team to build using Dataiku.

Read more

Go Further:

Making Enterprise AI an Organizational Asset

How can your company become an AI enterprise? Dataiku enables organizations across all industries to embed machine learning methodology into the very core of their business to bring real value.

Learn More

How DAZN Scaled a Small Data Team using Machine Learning

See how DAZN leveraged Dataiku to enable non-technical staff to perform advanced customer segmentation, content attribution, and churn prediction.

Learn More

Enabling AI Services Through Operationalization and Self-Service Analytics

Many organizations with the hope of becoming more data-driven ask the question: self-service analytics, or data science operationalization - which will get me where I need to be? And the answer is: you need both.

Learn More

5 Ways to Accelerate and De-Risk Business Transformation Through AI

Find out how organizations can move from theory to practice when it comes to using AI-enabled solutions to drive business transformation (including process, digital, management, organizational, and cultural transformation).

Learn More