BSM and SLAs
About BSM and SLAs
Service Level Agreements (SLAs) are the backbone of IT management. The structuring of IT into service delivery and service management organizations under ITIL has made the SLA a primary method of setting expectations for services. SLAs are useful for estimating costs and budgets, and to justify and control the frequently high costs of IT services.
SLAs in GroundWork Monitor are most useful when combined with Business Service Monitoring (BSM). BSM provides a way to group objects and assign status to a group based on the state of its members.
Users can define the members as hosts or services, or even other BSM service objects. The resulting object is mapped to a simple service in GroundWork Monitor, and can be assigned to any host. Notifications on state changes work through the notification manager, GroundWork Messenger, and are essentially the same as other non-nagios services. BSM service objects can be very useful in defining SLAs, in that they can represent the status of complete areas of a business operation with a single service.
SLAs in GroundWork Monitor operate on services in GroundWork Monitor. Note this is not necessarily the same as a service on a host. For example, a check_http in Nagios is a service, while a BSM service object might be the combined CPU measures of all ESXi hosts in a cluster, with the status of Warning when one CPU is over threshold, and Critical when two or more are. The concept of a service encompasses both of these ideas in the GroundWork Monitor model.
GroundWork provides a simple, straightforward way to integrate and monitor your SLAs through a series of applications, including BSM, SLAs, SLA Dashboards, and SLA Reports.
View related blog posting fas fa-blog
Let's take an overall look at how to use BSM and SLAs in GroundWork Monitor, configuration details can be found in the following sections below.
BSM Membership
The first thing you need to determine is what the membership of your SLA service should be. Which monitored components make up the business service you want to report on? Of course you can just use single normal services, but that would make for a narrow SLA. It's better to have a BSM service object, with state determined by several members.
For this reason, it's a good idea to consider the services you need to implement when designing your monitoring system, and make sure you have the elements you need to support SLAs already in place when you start to create them. Typical things to consider as member services are:
- Web transaction monitors (end-user access to web portals)
- Clustered resources like web servers behind load balancers
- Connectivity checks (pings to critical routers, port availability on key applications)
- Application health measures
- Any aggregate service that represents availability, for example the CPU utilization across all ESX hosts in a cluster, as explained above
Once you identify what the membership should be, and make sure they are implemented in your monitoring, you will need to create the BSM object to represent your SLA. This is a little bit involved, since you have several options that determine when the object will change state, and thus affect whether you meet the terms of your SLA.
The first option is the states to consider a problem. You will see this screen when you click Show States while configuring BSM service objects. For detailed instructions see the section below titled Configuring BSM. Which states you select depends on your actual written SLA document. Selecting ACKNOWLEDGED states as affecting SLA status might or might not be something you want to do, for example. Is it an SLA violation if someone acknowledges the problem? What if it's in downtime? You may need to discuss and negotiate these terms with your customers (internal or external), and get sign-off on the precise configuration you decide to use.
You will also want to set thresholds, which can be in percent or absolute numbers. The guideline here is that you should use percent if the membership fluctuates, for example as members of a service group are added, like virtual machines that come and go in a given service group. The states that are considered problems are the ones that contribute to thresholds. Host groups can't (and should not) be used as ways to group objects for BSM, and therefore SLAs, because they contain all service states and host states, and are too broad to be meaningful.
Finally, there is the essential classification of a member. If a member is Essential, the BSM service object will be in a Critical state if it has a problem listed in the states to consider a problem, regardless of the state of the other members.
Using these methods, you can construct your BSM objects to use in your SLAs.
SLA Components
SLAs are composed of a schedule and a contract. When you create an SLA in GroundWork Monitor, you are actually creating an operation schedule. The simplest schedules are 24x7x365, but many businesses do not need to or want to incur the expense of maintaining an always-on schedule for all systems. Maintenance windows are a requirement, and the SLA allows you to build them in.
When defining an SLA you can use existing Holidays or define your own along with Calendar options. You will want to make your selection of a Calendar and Operation time for the SLA definition. For detailed instructions see the section below titled Configuring SLAs.
- Schedule
- Holidays - Same day and moving holidays for use within the Calendar definition.
- Calendar - Incorporates predefined holidays for use within the SLA definition.
- Operation Time - Agreed days and times of operation for use within the SLA definition.
- Contract - An agreement of service which contains a measured item, target availability, and SLA.
Along with defining SLA contracts, and administrator can schedule Downtime using SLA contracts which directly affects SLA reporting. SLA reporting provides information on monitored elements availability to ensure quality of service, and a scheduled system maintenance downtime is something you probably wouldn't want to include in the overall calculations of a report. The SLA Reports option allows you produce two types of service level agreement reports, a website report to view directly or an XML file type report to download.
You can then connect the SLA to the service you created using a Contract. This is simply a link between the service and the combination of calendar and hours that the service should be operating, plus a percentage of that time in which it should be up. Contracts can have an Alias as well as a name, and can be assigned to a Customer, which helps in reporting and in Dashboards.
Once you have a Contract, you are monitoring the SLA. When the data is available, you can generate SLA reports, and display the SLA on dashboards you create using a variety of objects.
SLA Dashboards
To display various SLAs on dashboards, you need to make a few choices. How do you want your dashboard to look? Should there be a map on the background? Your logo? A color panel? Text elements to explain what you are looking at? You can show an SLA as a tachometer dial, a pie chart, a timeline, and other options. For detailed instructions see the section below titled SLA Dashboards.
There is also a Publish status tool used to enable service information outside of GroundWork Monitor. See the document Publishing Status under Reporting.
SLAs as Reporting Objects
You can also edit past events and set downtimes that are SLA relevant. If, for example, you missed the start of an outage and you know the time it actually started, you can add downtime to the SLA contract after the fact. You can also remove or edit false positive events where the monitoring reported a problem, but the resources were not actually down.
SLA Reports
Once your SLAs are established, edited and on dashboards, you can also export the data as CSV files, or just run a graphical report for the period you want. The slareports database is available for direct SQL queries as well, and can be used to create custom reports in a report designer. See the document SLA Reports under Reporting.
Implementing BSM
The Business Service Monitoring (BSM) feature provides the capability to represent the condition of business groups, processes and applications according to the combined states of individually monitored hosts and services. The method allows the free combining of hosts, services and/or groups as components of named groups. The combinations can be assigned numerical values to weigh or qualify the member items. In this way you can produce a hierarchical representation of the state of business services independently of the methods used to monitor components. Notifications and state changes of BSM groups are integrated into the GroundWork Monitor system using the notification manager engine and status for further processing. Creating BSM entities and enabling them does not require configuration changes and subsequent commits.
- After signing in to GroundWork Monitor with an admin user who also has the BSM_Admin role, navigate to Configuration > BSM and SLAs > BSM to create a new BSM group. Click Manage groups, then Create.
- Next, you'll need to define what makes up the BSM group and how it should be monitored. Starting with the top section. Fill in the fields:
- Enter the Display name of the BSM group. This is visible only in the BSM interface.
- Add a Description to the BSM group. You might need to display what it’s for to your coworkers.
- Optionally add a Note for additional reference, and Info Text describing an optional Info URL which is simply a related URL users can link to.
- Checking the box for Monitoring (shown below) enables monitoring for this BSM group, meaning the results will show up in GroundWork Monitor.
- When the monitoring box, from the previous step, is checked the Host Definition, Service Definition, Hostgroup, Priority, and Status message are exposed for entries:
- Host Definition and Service Definition are a named host/service in which the BSM check process will send the check results, these can be monitored names or representative business service names.
- Hostgroup Definition is a named host group in which the BSM host will be added. This can be monitored name or representative business service name.
- Priority can be set to High, Medium, or Low for each group, which enables only High priority groups to be displayed within the View groups tab.
- The Status message option is used to build your own format for the message in Stauts. As stated on screen, you can build your format with HTML (inline CSS and JS) with the following tags: [DISPLAY], [DESCRIPTION], [NOTE], [INFOURL], [INFOTEXT], [OUTPUT], [LONGOUTPUT] and [STATUS].
Next, checking the Thresholds box enables thresholds to be set for the group and determines what is reported.
When creating a new BSM group you will need indicate the Group Members before setting Thresholds. If you are using Thresholds skip to step 7, then return to this step.
The thresholds are set as a count (default) or percent (check box) of problems that must be reached in a group before the state changes to Critical or Warning. For example, Critical is set to 3 problems and Warning set to 2 problems, and if there are two problems within the group's members a Warning state will be reported, if 3 a Critical state will be reported, if 1 an OK state is reported.
By selecting Show States you can indicate specific states to be counted as problems, by default all states are selected excluding UP and OK.
- The bottom left section, using the filters to search, add Available Members to the Group Members on the right, then Save the BSM group. You are selecting the members to be included int the BSM group you are creating, which based on the members state, essential settings, and any thresholds, will determine the state of the overall BSM group.
- Available Member:
- Available members are configured hosts, services, service groups, and BSM groups which you can Add to the BSM Group Members column.
- To search Available Members click a radio button for the desired output, and enter a search criteria for the appropriate element (e.g., we want a Service output for the Hosts beginning with the string training).
- Group Members:
- Group members make up what is to be monitored within the BSM group being created and is populated from the available members on the left.
- Checking the Essential box for a group member, enables that member to decide the entire groups state.
- Non-essential members will cluster together. The state is calculated on the threshold settings.
- Available Member:
Implementing SLAs
You'll need to start the configuration of SLAs by establishing holidays that will define a calendar that is relevant to operations. This includes holidays for the same date every year and also moving holidays which have a different date every year. The are some Holidays are used in Calendars, and Calendars and Operations time are applied within a SLA.
- After signing in to GroundWork Monitor with an admin user who also has the BSM_Admin role, navigate to Configuration > BSM and SLAs > SLAs.
- Go to the Holidays tab, and select one of the Create buttons, the top one is to add a holiday for the same day every year such as Christmas, the bottom Create button is for a moving holiday, such as Easter, continue by entering the information for Day, Month, and Description, and click Create. Enter a few for each category, and you can come back and edit.
- Next, the Calendar option enables you to create and manage multiple calendars and events such as established holidays. Go to the Calendar tab, select Create, then enter a Description, and using the Shift and Ctrl keys scroll through the options and select the appropriate repeating holidays and moving holidays for the SLA. Click Create/Save.
- Now we'll focus on the SLA definition which includes a calendar and an operations time, you could of course use one of the default SLAs. Select the SLA tab, and click Create, then enter a Description, select a Calendar, and select an Operation time for the new SLA. Then, click Create. The SLA will be added to the list of SLAs. To edit an existing SLA, you would start by selecting the pencil icon on the corresponding SLA row.
- Operation times allows you to set a period during which a system should work in a manner acceptable to the operators and users. These times are initially created in the Operation time tab, similar to how holidays and calendars are created.
- You can use Rules for combining conditions to that you want to affect the SLA status, for more information see Appendix A: SLA Definition Rules.
- Next are SLA Contracts which define the items to be measured and their target availability associated with a SLA. Go the the Contracts tab, click Create and enter the various fields described here. The contract items are then associated with a defined SLA and therefore bundling a contract item, with holidays, and operation time. When finished click Create.
- Name: The title of the contract item
- Alias: An alternative name
- Customer - Name: The name of the SLA customer (internal or external)
- Host Name: Name for the monitored host
- Service Name: Name for the monitored service
- Target Availability: An agreed upon percentage targeting the availability of the measured items, e.g., 99% target for availability
- SLA: The associate SLA for this contract
- Priority: Used for searching and filtering from the list of contracts
- Archived: After this date the contract is hidden and will not longer be processed for reports
Creating SLA Dashboards
The SLA Dashboards feature allows an administrator to configure status and availability dashboards. To display SLAs on dashboards, you need to make a few choices. How do you want your dashboard to look? Should there be a map on the background? Your logo? A color panel? Text elements to explain what you are looking at? You can show an SLA as a tachometer dial, a pie chart, a timeline, and other options. The objects have order (one can go over the other), so you can use graphical objects such as an image or text object to annotate your display and provide a background. There is also a Publish Status tool used to enable service information outside of GroundWork Monitor.
- To create an SLA dashboard first go to Configuration > BSM and SLAs > SLA Dashboards and click Create, and enter a Title and Description for the dashboard. Optionally, by checking the box for show on dashboard carousel you can provide users a glance of each dashboard on a rotating basis through Dashboards > SLA Carousel. Click Create.
- Next, add elements to the dashboard by expanding the main selections on the right and dragging widgets onto the dashboard canvas one by one. For example, SLA Elements and the Tacho widget to add a tachometer image. A configuration dialog will open for the chosen widget where you'll need to indicate various items including SLAs. The objects have order (one can go over the other), so you can use a graphical object such as an image or text object to annotate your display and provide a background.
- SLA Elements: These are tied to configured SLA contracts, and offer a quick graphic views for downtime (used, remaining, max), log list (states), pie chart (run, down times...), report table (similar to pie w/availability), Tachometer (availability in %), timeline (periodic color coded status), and availability timetable.
- Status Elements: Include a list of services widget and a status widget for a host, host group, service, service group, or custom group. If you are using BSM service objects, this is a good place to show their status. You might also want to put a web transaction status here, or perhaps a custom plugin to display the text you need to have on your dashboard.
- Other Elements: Enables the addition of boxes, images, text, or iframes (including Grafana dashboards).
Use Case: Creating a Summary SLA Dashboard
This use case reviews how to use the SLA Dashboard feature to visualize summary data for different monitoring groups and also drilldown to additional information.
We use customers as an example, while in your case you may want to separate the detailed dashboards by function, region, or however you choose. In addition, by using more detailed dashboards, you can:
- Provide a link to the summary view dashboard to those who need a high-level overview of all customers (Team Leads, Managers, etc.), and also provide capability to drilldown to more detailed information.
- Provide a link to the detailed view of a customers dashboard to the technicians responsible for day-to-day upkeep of that customer’s infrastructure
Additionally, you do not have to have a SLA tied to something in order to visualize it on the SLA dashboard, you can show the status of any host, host group, service, service group, or custom group.
Step 1: Creating a detailed dashboard
- Go to Configuration > BSM and SLAs > SLA Dashboards, then click Create.
- Next, provide a Title and Description (e.g., Customer1Status) for the dashboard, then click Create, and the dashboard canvas is will be displayed on the left side of the screen along with dashboard elements on the right.
- Let's start with adding an image to the dashboard. We use a diagram of Customer 1’s infrastructure, however you can use any PNG image in a SLA dashboard, even multiple images if you choose to do so.
- To add an image, click to expand Other Elements, then drag the Image element onto the dashboard, this will pop up a menu to configure your image.
- Under Upload an image, click Choose File, and upload the image you wish to use, then click Create.
- Once your image is added, you may need to resize it, you can do this by clicking and dragging the corners of the image.
- Since we also want to know the status of each of the systems in this diagram, we need to add a status element by expanding Status Elements, then dragging the element named Status onto the dashboard over one of the systems in the diagram. This will bring up a menu to allow us to assign this status widget to a particular Host, Host Group, Service, Service Group or Custom Group. Also, when selecting Host or Host Group, you can opt to include the state of the services.
- For the purposes of this example, we add one of Customer1’s network devices to the dashboard by searching for it in this menu, checking the box for the host, checking the box to include service status, and clicking Create.
- Once the new status widget is in place we can right-click it then click Display Preference to configure the size and title visibility, as shown in the screenshot below.
- Now that we’ve created our first status widget, we can repeat the process to create more, or clone the one we just created by right-clicking on the icon, and clicking clone. This will create a copy which you can then right-click, click configuration, and change to a different host.
- Once complete click to Save your new dashboard, and Cancel to close the editor. This will take you back to the SLA dashboard list. Click on your newly created dashboard to view it, and copy the URL to a text editor, we’ll need this when creating the summary view.
- For reference, here is what our completed detailed dashboard looks like:
Step 2: Creating a summary dashboard
Now, we can begin with our summary dashboard.
- Go to Configuration > BSM and SLAs > SLA Dashboards, then click Create.
- Next, we add the same image we used in our detailed dashboard: Click to expand Other Elements, then drag Image into the dashboard, this will pop up a menu to configure your image. The image will be listed, so this time you can just click the image and then click Create.
- We resize the image to be smaller this time, since this is the summary dashboard. Keep in mind for this summary dashboard, we may later want to add additional summaries for separate hosts, so size this image accordingly. Once you’re satisfied with the size of the image, right-click the image and click Sent to back.
- Next, we’ll add a box around the image by expanding Other Elements, dragging the box widget onto the dashboard.
- Here, enter the URL of the detail dashboard we created earlier under the Url section, and check the box to Open in new window, click Create, then size and drag the newly added box widget to surround the summary image just added, it will look something like this:
- Here, enter the URL of the detail dashboard we created earlier under the Url section, and check the box to Open in new window, click Create, then size and drag the newly added box widget to surround the summary image just added, it will look something like this:
- Now, drag a status element by expanding Status Elements and dragging a status widget onto the dashboard. For this configuration, select Host Group for connection type, check the Include Service Status box, and select the host group from the list by checking the corresponding box.
- Once, created, right click on the widget and click Bring to front. You can also right click on the widget and select Display Preference to show/hide the title, or change the icon size. This will provide an overall status for the hosts and services, and you will be able hover over it to see more information on what’s wrong, but still at a high level as far as visualization of where the problem is physically on the network. Place this within the box we just added, perhaps the top-right, but it is a matter of preference.
- You can also add text to this dashboard. To add text, expand Other Elements, and drag the Text widget onto the dashboard. You’ll be prompted to add text, and can change the size, color, style of the text, and even add a link if you’d like. For our example purposes, we add a title. Once you’re satisfied with your text, drag the next element within the box.
- Here is what the summary dashboard looks like once completed for this example:
- If you see a high-level status is something you want more detail on, you can click this part of the dashboard and it will bring up the detailed dashboard.
So, now we’ve covered how to create a detailed dashboard that shows the physical layout of a network, to enable us to not only identify there is a problem and what the problem is, but also where it is. With this, if we have a user with a slow-running query, we can see there are issues with discards along that user's connection path (Access 2 in the example network), and begin addressing that as the possible root cause very quickly.
Step 3: Adding additional summary dashboards
Not everything is a physical dependency of course, and while we should show logical dependencies in a dashboard, it is usually more appropriate to see that in a summary view, and get the detail of the state of those logical elements from the Status Summary dashboard instead.
Let’s take for example, a web application. Many web applications require a few things to be in good condition in order for full functionality:
- A database (for this, my service checks are 2 queries against the MySQL database)
- A web server (these service checks will be measured synthetic web transactions)
- A mail server (forgot password, etc.)
What type of image you use to visualize this summary for logical status is of course, up to you. Here is what our example looks like:
By now you should have a good understanding of how to add all of the required widgets to the dashboard, so we won’t describe that step by step this time. For each hex with a white background, we added a state widget for each service that provides that dependency for the web application. Then, we add a host status widget in the center which will give the state of the host itself.
The primary difference with this summary dashboard is that we don’t have a detailed dashboard to link to in order to get more information - because we really don’t need it, if we need more information on this particular summary, we should link to the Status Summary dashboard instead. So, when adding the link to the box widget for this type of check, simply link to the host status (or host group, custom group, NOC board, whatever you like!) in Status Summary, that will be the URL presented in your browser when viewing the status of a host.
Here’s what our current dashboard looks like after implementing the examples we've shown here:
Appendices
Appendix A: SLA Definition Rules
The Rules field in the SLA definition is optional. You can use rules for combining conditions to that you want to affect the SLA status, or masking small outages so that they do not affect the SLA unless they are above a threshold in number or duration. Sometimes there are failures that can be classified with rules. For example:
- A failure lasting less than 5 min can be classified as not SLA relevant
- Failures lasting less than 5 min are classified as non-SLA relevant, but if the sum of these short failures is greater than 10 min, the failures are then classified as SLA relevant
Formal structure:
Rules are composed of three elements: a Condition, a Type for that condition, and a Value. Generally, if the condition is met, the state indicated by "then" is returned.
{"conditions":[{"type":":duration|:count", "comparison":"<|<=|>|>=", "value":, "operator":"AND"},],"then":"PARTIAL CRITICAL"}
You may define several conditions. The possible condition types are "duration
" and "count
", which are compared using the <, <=, >, or >= operators. The value to compare the duration or count to is a positive integer, the meaning of which varies by Condition Type (seconds for duration, and count is just a number). There is an operator for combining conditions (only AND is supported at this time)
Rules are stored in the SLA object:
Rules are captured with the SLA, and stored in the SLA object as json. Examples: Non-SLA relevant outages: Outage < 5 minutes which is not SLA relevant. (Status = OK)
Example 1: Up to 5 short outages
Classify a series of up to 5 failures each under 5 minutes is as not SLA relevant (status = OK){"conditions":[{"type":":duration", "comparison":"<", "value":"300"},],"then":"OK"}
CODEExample 2: One short failure as "PARTIAL CRITICAL"
Classify a failure of under 5 minutes as a custom status, like PARTIAL CRITICAL. Note that this SLA object status is an extension to the possible statuses regular services or BSM service objects in GroundWork.{"conditions":[{"type":":duration", "comparison":"<", "value":"300"},{"type":":count", "comparison":"<", "value":"5", "operator":"AND"},],"then":"OK"}
CODEExample 3: Combined outage
You can also classify a series of short outages as PARTIAL CRITICAL based on the total combined outage duration.{"conditions":[{"type":":duration", "comparison":"<", "value":"300"},],"then":"PARTIAL CRITICAL"}
CODEExample 4:
{"conditions":[{"type":":duration", "comparison":"<", "value":"660"},{"type":":sum", "comparison":"<", "value":"3000", "operator":"AND"}],"then":"PARTIAL CRITICAL"}
CODE
Related Resources
-
Page:
-
Page: