Service Level Agreements (SLAs) are the backbone of IT management. The structuring of IT into service delivery and service management organizations under ITIL/ITSM has made the SLA a primary method of setting expectations for services. SLAs are useful for estimating costs and budgets, and to justify and control the frequently high costs of IT services.
GroundWork provides a simple, straightforward way to integrate and monitor your SLAs through a series of applications which are included in the core version, and a few additional in the commercially licensed enterprise version of GroundWork Monitor. These include Business Service Monitoring (BSM), SLAs, SLA Dashboards, and SLA Reports.
SLAs in GroundWork Monitor are most useful when combined with Business Service Monitoring (BSM). BSM is included in the core version of GroundWork Monitor, and provides a way to group objects and assign status to a group based on the state of its members. Users can define the members as hosts or services, or even other BSM service objects. The resulting object is mapped to a simple service in GroundWork Monitor, and can be assigned to any host. Notifications on state changes work through the notification manager NoMa, and are essentially the same as other non-nagios services. BSM service objects can be very useful in defining SLAs, in that they can represent the status of complete areas of a business operation with a single service.
SLAs in GroundWork Monitor operate on services in GroundWork Monitor. Note that this is not necessarily the same as a service on a host. For example, a check_http in Nagios is a service, while a BSM service object might be the combined CPU measures of all ESXi hosts in a cluster, with the status of Warning when one CPU is over threshold, and Critical when two or more are. The concept of a service encompasses both of these ideas in the GroundWork Monitor model.
Using SLAs and BSM in GroundWork Monitor
The first thing you need to determine is what the membership of your SLA service should be. Which monitored components make up the business service you want to report on? Of course you can just use single normal services, but that would make for a narrow SLA. It's better to have a BSM service object, with state determined by several members.
For this reason, it's a good idea to consider the services you need to implement when designing your monitoring system, and make sure you have the elements you need to support SLAs already in place when you start to create them. Typical things to consider as member services are:
- Web transaction monitors (end-user access to web portals)
- Clustered resources like web servers behind load balancers
- Connectivity checks (pings to critical routers, port availability on key applications)
- Application health measures
- Any aggregate service that represents availability, for example the CPU utilization across all ESX hosts in a cluster, as explained above
Once you identify what the membership should be, and make sure they are implemented in your monitoring, you will need to create the BSM object to represent your SLA. This is a little bit involved, since you have several options that determine when the object will change state, and thus affect whether you meet the terms of your SLA.
The first option is the states to consider a problem:
You will see this screen when you click Show States while configuring BSM service objects. See How to configure BSM groups for detailed instructions.
Which states you select depends on your actual written SLA document. Selecting ACKNOWLEDGED states as affecting SLA status might or might not be something you want to do, for example. Is it an SLA violation if someone acknowledges the problem? What if it's in downtime? You may need to discuss and negotiate these terms with your customers (internal or external), and get sign-off on the precise configuration you decide to use.
Also on this page are the thresholds, which can be in percent or absolute numbers. The guideline here is that you should use percent if the membership fluctuates, for example as members of a service group are added, like virtual machines that come and go in a given service group. The states that are considered problems are the ones that contribute to thresholds. Host groups can't (and should not) be used as ways to group objects for BSM, and therefore SLAs, because they contain all service states and host states, and are too broad to be meaningful.
Finally, there is the essential classification of a member:
If a member is Essential, the BSM service object will be in a Critical state if it has a problem listed in the states to consider a problem, regardless of the state of the other members.
Using these methods, you can construct your BSM objects to use in your SLAs.
SLA operation schedule and contract
SLAs are composed of a schedule and a contract. When you create an SLA in GroundWork Monitor, you are actually creating an operation schedule. The simplest schedules are 24x7x365, but many businesses do not need to or want to incur the expense of maintaining an always-on schedule for all systems. Maintenance windows are a requirement, and the SLA allows you to build them in.
Selecting the SLA tab you can create SLAs, and separately you can define Holidays and Calendar options. You will want to make your selection of Calendar and Operation time for the SLA definition. See How to configure SLAs for a detailed description.
- Holidays - Same day and moving holidays for use within the Calendar definition.
- Calendar - Incorporates predefined holidays for use within the SLA definition.
- Operation Time - Agreed days and times of operation for use within the SLA definition.
- Contract - An agreement of service which contains a measured item, target availability, and SLA.
Along with defining SLA contracts, and administrator can schedule Downtime using SLA contracts which directly affects SLA reporting. SLA reporting provides information on monitored elements availability to ensure quality of service, and a scheduled system maintenance downtime is something you probably wouldn't want to include in the overall calculations of a report. The SLA Reports option allows you produce two types of service level agreement reports, a website report to view directly or an XML file type report to download.
The Rules field in the SLA definition is optional. You can use rules for combining conditions to that you want to affect the SLA status, or masking small outages so that they do not affect the SLA unless they are above a threshold in number or duration. Sometimes there are failures that can be classified with rules. For example:
- A failure lasting less than 5 min can be classified as not SLA relevant
- Failures lasting less than 5 min are classified as non-SLA relevant, but if the sum of these short failures is greater than 10 min, the failures are then classified as SLA relevant
Rules are composed of three elements: a Condition, a Type for that condition, and a Value. Generally, if the condition is met, the state indicated by "then" is returned.
You may define several conditions. The possible condition types are "
duration" and "
count", which are compared using the <, <=, >, or >= operators. The value to compare the duration or count to is a positive integer, the meaning of which varies by Condition Type (seconds for duration, and count is just a number). There is an operator for combining conditions (only AND is supported at this time)
Rules are stored in the SLA object:
Rules are captured with the SLA, and stored in the SLA object as json. Examples: Non-SLA relevant outages: Outage < 5 minutes which is not SLA relevant. (Status = OK)
Example 1: Up to 5 short outages
Classify a series of up to 5 failures each under 5 minutes is as not SLA relevant (status = OK)
Example 2: One short failure as "PARTIAL CRITICAL"
Classify a failure of under 5 minutes as a custom status, like PARTIAL CRITICAL. Note that this SLA object status is an extension to the possible statuses regular services or BSM service objects in GroundWork.
Example 3: Combined outage
You can also classify a series of short outages as PARTIAL CRITICAL based on the total combined outage duration.
You can then connect the SLA to the service you created using a Contract. This is simply a link between the service and the combination of calendar and hours that the service should be operating, plus a percentage of that time in which it should be up.
Contracts can have an Alias as well as a name, and can be assigned to a Customer, which helps in reporting and in Dashboards.
Once you have a Contract, you are monitoring the SLA. When the data is available, you can generate SLA reports, and display the SLA on dashboards you create using a variety of objects.
To display your various SLAs on dashboards, you need to make a few choices. How do you want your dashboard to look? Should there be a map on the background? Your logo? A color panel? Text elements to explain what you are looking at?
You can show an SLA as a tachometer dial, a pie chart, a timeline, and other options. See How to configure SLA Dashboards for a detailed description.
To add elements to the screen, expand the menu on the left and drag elements onto the dashboard one by one. The configuration dialog will open, and you will select the SLA or other element properties, and configure them. Depending on the widget, you can adjust the display characteristics, the period, and the refresh time.
As these are easy to experiment with, and the meanings are obvious, you should simply explore them and find out which ones you wish to use.
There are status elements as well, and if you are using BSM service objects, this is a good place to show their status. You might also want to put a web transaction status here, or perhaps a custom plugin that displays the text you need to have on your dashboard.
The objects have order (one can go over the other), so you can use a graphical object such as an image or text object to annotate your display and provide a background.
There is also a Publish status tool used to enable service information outside of GroundWork Monitor. See How to publish status for use outside GroundWork.
SLAs as reporting objects
You can also edit past events and set downtimes that are SLA relevant. If, for example, you missed the start of an outage and you know the time it actually started, you can add downtime to the SLA contract after the fact. You can also remove or edit false positive events where the monitoring reported a problem, but the resources were not actually down.
Once your SLAs are established, edited and on dashboards, you can also export the data as CSV files, or just run a graphical report for the period you want. The slareports database is available for direct SQL queries as well, and can be used to create custom reports in a report designer.