Top Challenges in today’s IT Ops ecosystems: Part 1

May 24, 2021
Francesco Paola

AI is top of mind for every CIO and her team. The promise of artificial intelligence with machine learning models to analyze ever increasing data volumes and diverse data types, and to proactively resolve customer inquiries and system alerts is prompting IT leaders to invest in platforms that leverage the investment to date in their IT infrastructure and deliver on the promise of lower total cost of ownership, increased customer satisfaction and enhanced bottom lines.

However, the achievement of these benefits has been slow. Today’s distributed IT environments, a mix of physical and virtual applications and infrastructure, with higher levels of automation, massive amounts of diverse data and siloed platforms, compounded with customer access across multiple channels – voice, text, chat, social, online – can strain even the best organized service desks and network operation centers (NOC).

Here we analyze the top challenges facing today’s Service Desk and IT support ecosystems and provide guidance on how to rapidly and efficiently address these challenges by intelligently deploying AIOps solutions, positioning the organization for scale.

 Alert Storms: More data points do not necessarily mean a better managed IT estate  

Let’s start with the process from when alerts are generated to when tickets are created – the “alert to ticket” challenges.

The most prevalent IT Ops challenge is alert storms; it is the primary manifestation of the “too much data” problem in IT support. The uncontrolled generation of alerts overwhelms the support desk, causing the most critical alerts to be missed or to be processed after significant delays, and impacting the ability of support teams to do their job – they are too busy being inundated with alerts. The delays cause unnecessary system downtime and potential service interruptions, impacting revenue and customer satisfaction.

Take a sell-side ad-trading platform as an example that relies on uninterrupted and streamlined network connectivity to generate revenue. As trading volumes increase, server capacity warning thresholds are breached, but the system still functions within its SLAs and performance is not impacted. At the same time, a critical network slowdown occurs, directly impacting trade volumes and therefore revenue. Which alerts should the support team be working on? - If the server threshold alerts are allowed to inundate the NOC, the critical network alerts may be lost, directly impacting the ad platform’s revenue.

But why does this happen?

There are several reasons, and some have to do with the monitoring platforms taking an alert-based view of the world, not a business or customer landscape view (what is important to a specific customer) and others with the poor implementation of the platforms.

One common occurrence is that existing monitoring platforms are reactive to alerts and inquiries as opposed to proactive – they simply pass on the alerts as opposed to correlating them to similar or related alerts and proactively providing a recommended resolution to the rep, like when procurement and supply chain resources are unable to access their inventory management system, and at the same time operational line staff cannot access the ERP platform – the monitoring platform should be able to correlate these like inquiries as opposed to the reps having to do so and determine the root cause of the failures.

Another reason is that the monitoring platform implementation may have been performed poorly, such as when it is poorly instrumented, for example, physical (or virtual) hardware assets such as servers, routers, and firewalls are not tagged appropriately and are not integrated well with the CMDB, or the CMDB is not kept up to date or configured properly.

Classification and triaging of alerts

Classification and categorization of alerts seem to be an afterthought in some IT organizations, for example, service level agreements are not applied correctly or at all to alerts and hence critical alerts for inaccessible systems, network congestion, malware detection or other security breaches get lost in the noise and overwhelm the IT support team.

Finally, the monitoring and alerting platform itself is sub-par or outdated. It does not enable automated triage, causality analysis and categorization of alerts prior to ticket generation, inundating the service desk and NOC with the unmanageable volumes discussed earlier.

The ad-trading system example above was poorly instrumented, alerts were not properly classified or categorized, and the monitoring tool had no correlation capabilities, shifting the analytical onus to the reps, directly impacting revenue.

 Siloed Monitoring Solutions: A modern retelling of the elephant and blindfolded men tale  

A second common issue is the proliferation of siloed monitoring solutions. As IT departments invest in new and improved technology platforms to keep up with demand, more often than not they do so with point solutions that bring the promise of quick and easy deployment and integration across the enterprise. Sorry to disappoint you, but enterprise IT is not known for visionary investments. It’s more about how I can solve this problem now, thus incurring technical debt.

This lack of integration proliferates the data issue in more ways than one. With no unified platform to bring it all together, these point solutions don’t provide alert and ticket correlation – exacerbating the challenge of managing and processing massive volumes of data that live in siloed, disparate systems. For example, in many support organizations the Service Desk is a separate entity from the NOC, using different systems of engagement: the Service Desk may use Front or a ZenDesk to manage their inquiry queue, while the NOC may receive alerts from an APM like AppDynamics and use an enterprise ITSM platform like ServiceNow.

Connecting the dots: Context and correlation

Tickets (both user & machine or alert generated) are received in the support queue with little or no context, meaning that the rep is left to their own devices to correlate tickets, research possible resolution options delaying the process of resolving the issues and closing the tickets – extending troubleshooting time and raising the cost per ticket. For example, in a CPG manufacturing organization, a simple “user cannot access the MRP system” without knowing whether it’s a network issue, a server issue, a system overload, a denial of service (DoS) or another system issue will take up unnecessary research time by the rep and prolong the resolution time, potentially delaying the manufacturing of the good, causing stockouts at the retail level.

So when an inquiry is received by the Service Desk that is caused by a systems issue, for example, a user is unable to access their benefits and payroll information in the HRIS platform (because the virtual server hosting the instance has been inadvertently taken offline say) the Service Desk may not have the requisite systems visibility and simply passes on the ticket to the NOC with little to no context – forcing the NOC rep to research the issue from square one.

 Issue Resolution: The forgotten raison d'être of monitoring and alerting  

Having sorted out issues with alert storms and integrated monitoring, the challenges pertaining to ticket resolution remain. What. What happens once an alert is converted to a ticket, what functions does the IT support desk have to perform in order to expeditiously resolve the issue?

As a support engineer tries to determine the root cause of the inquiry, disparate knowledgebases containing SOPs, service level agreements, contracts and system configuration force the rep to have to access multiple systems and configuration files, and manually determine the appropriate resolution, as opposed to having the system of engagement recommend one or more possible paths. Unless the process has been automated, the support engineer may have difficulty accessing system data, for example log and configuration files, delaying the ability to extract insights about the underlying IT systems, monitoring operational and usage statistics, and proactively solving application performance problems.

Knowledge Scatter: Finding a needle across multiple haystacks

As the support engineer is left to their own devices to research the issue, the challenges are compounded by disparate systems and disparate data sources: the same problem highlighted above, but this time, due to the nature of modern IT infrastructure with the mix of physical and virtual environments, combined with disparate and large data sources that may not be up to date make it challenging for individual reps to quickly identify the right resolution to the inquiry or alert.

Human in the loop: Is full automation achievable, or even the goal?

In addition, the mantra of “automate everything” has permeated many an IT organization – good. But in many cases the execution of the automation requires human intervention – the platforms are not sophisticated enough and hence not trusted enough so a human has to trigger the automation script, delaying the resolution. What we refer to as augmented automation.

Closing the loop: Moving from tribal knowledge to institutional knowledge

Finally, once the issue is cleared and the ticket is closed, the resolution is not necessarily institutionalized: in reality, it’s not a completed job unless the resolution is correlated with the ticket and like tickets, and that information is stored and made accessible for the next rep. It is difficult for these disparate systems to learn, i.e., there is little to no institutional memory that can be leveraged to continually optimize the service.

This issue is especially critical where processes in IT support organizations are people dependent, and if there is high churn, then the institutional memory walks out the door, and you’re back at the starting point. Sure you can train the rep to manually update the knowledge base, but will they do it, and will they have time if they are inundated with alerts and tickets?

 What next?  

 The issues with the current state of IT Ops are well understood and have been addressed to varying degrees with traditional process improvement methods and tools. Does machine learning hold some answers that will make a step rather than incremental change to the alerting and monitoring process?

Look out for the next part of our blog on how machine learning can indeed be that vector of change.  

About the author
Francesco Paola