For a Service Desk to efficiently and consistently provide quality support, Event Management must be implemented along with Incident Management to ensure that technical staff always know what is happening in the environments they administer. Often in architecture and delivery, we talk about the “Five 9’s” or the 99.999% uptime guarantee that is built into many Service Level Agreements (SLA), especially for hosted solutions offered in “as a service” models. In order to ensure that unplanned outages are minimized to 0.001% of required availability time (less than 10 minutes per calendar year. For more info on this, Wikipedia has a really cool chart showing different uptime levels), monitoring systems and alarm & response protocols must be put into place.
At Orion, we offer Monitoring as a Service and use a variety of powerful tools to keep our clients’ networks and data systems always healthy, always available, and Always On. Guided by ITIL Event Management, we have developed a series of standard procedures for implementing monitoring solutions based on the client’s specific needs, investigating alarms and warnings discovered through regular and routine monitoring, and responding to critical failures and outages in a rapid, coordinated fashion.
While we do our best as IT Professionals to keep systems healthy and online, it is impossible to eliminate that 0.001% entirely and achieve 100% uptime. Hardware wears and tears, software faults, code breaks, and people make mistakes. When any of these things happen it can cause a major outage for systems and users, and it can quickly cost organizations enormous amounts of money. Quickly responding to unplanned failures and working together with our internal resources, clients, vendors, and other support professionals ensures that downtime, and therefore lost revenue, is minimized.
Whenever a mission critical system is configured for monitoring, a set of response procedures are also put in place for that particular business service. If or when a system fails, or when maintenance is required, Service Owners are notified and kept informed, Engineering and Service Desk technical staff coordinate response and remediation efforts, and external escalations are centrally managed. Through this group effort, lead by the Significant Incident and Critical Situation Team, all parties involved are aware of the current status, what action must be taken to mitigate or remediate the problem, who is responsible for taking that action, and when services are estimated to be restored.
Of course, we are not always able to fix what is broken and sometimes must implement workarounds instead. Stay tuned for more on Problem Management and check out an upcoming post on Business Continuity and Disaster Recovery in my new series: Rethinking Modern Architecture with Microsoft Azure Cloud Solutions.