What is Cloud Hub
In order to create advanced functionality for simultaneous monitoring of multiple and heterogeneous virtual server environments subject to fairly rapid changes in configuration, GroundWork created Cloud Hub. Cloud Hub is a data collector agent specialized in gathering metrics from a variety of virtual environments and integrating them seamlessly into the GroundWork Monitor Enterprise system. As the name implies, it is a hub, or a central point from which to reach multiple endpoints. Those are API endpoints, which we poll with Cloud Hub on a regular, semi-rapid basis. Hubs rely upon GroundWork central monitoring services for role based access control, event management, notifications, performance graphing and analysis, reporting, dashboards, status maps, and run time user interfaces.
Cloud Hub itself runs as an app under a Java container, such as Tomcat and it's included in GroundWork Monitor. You can monitor multiple endpoints and report the results to a GroundWork API. That API can be on any GroundWork server - it doesn’t have to be the same one you are running the Cloud Hub from, so we can support distributed monitoring architectures. That being said, most people do just run it on the same system.
What can be monitored with Cloud Hub?
So what can you monitor with it? Well, not quite everything, yet, but there is a pretty easy development cycle for Cloud Hub connectors. A connector is like a plug in - it has the details you need to scrape monitoring data out of a particular API, such as the Amazon CloudWatch API, the cAdvisor API from Docker, or even MS Azure. We also have connectors for Google, VMware, OpenStack and Cloudera.
You might notice another connector, NeDi, which is a network discovery and monitoring system we have long bundled with GroundWork Monitor. We decided to create the connector because it’s an easy way to get monitoring results out of NeDi into GroundWork, and to use them for graphing, notifications, SLAs, and all the dashboards and features you see in GroundWork Monitor. NeDi is low-touch and fast, and so with it up and running you can do a lot of monitoring quickly, and with the connector you can hook that monitoring of network equipment into the server monitoring you do with Nagios, and the cloud monitoring you do with the other connectors. That’s the idea here - bringing all the monitoring results into one GroundWork console.
GroundWork and Remote Servers
So, how hard is is to set up a connector? Well, you need two endpoints defined. First, you need to set up the GroundWork Server to report the results to. That’s easy, since all the information is right here in the UI. You just grab the RESTAPIACCESS token (Administration > Security) and throw it into the connection dialog. You set a few options, and then test the connection. If it works, you can move on to the other side, the place you want to collect metrics from.
Before we go on, though, notice there are a couple of options on the GroundWork side, and these will vary with the connector a little. The Monitor option gives you a way to know when the connector is having trouble reaching the endpoint by creating a service on the host it reports to. This service is the name for the connector, and goes to warning whenever the connector misses a cycle, and to critical if it misses too many cycles. Why would a connector miss a cycle? Well, networks aren’t perfect, and endpoints aren’t always available even if the network is fine. The retries option gives you a way to control how long it will try, and if it fails too many times, all the hosts and services for that connector will become unreachable and unknown. You also have a Merge Hosts option, which tells the connector to treat hosts it finds with Cloud Hub that are the same name (but different letter case, like with Windows hosts sometimes) as those monitored with another method, and so fed in with a different application type. It’s good to use merge hosts when you use several methods to monitor the same hosts.
On the connecting-to-monitored-endpoint side, things are more involved. You will have to supply the necessary credentials to access the endpoint - that might be a username and password, or a set of keys (like with AWS). The account you use typically matters a lot, actually, since the access you give it determines what you can monitor. That’s a convenient way to filter what ends up in GroundWork, too, by the way, since you probably don’t want to monitor every VM or instance in your cloud (though of course, you can). Anyway it might be preferable to the Black List feature of GroundWork. In Azure, you generate an authentication file, which you then upload here.
So after you can authenticate, you might want to set another option or two, and this depends on the connector. Like in NeDi, you can monitor for device metrics and/or policy violations. Policies are useful as global alerts, like hey, someone just plugged in a phone and NeDi detected it as a policy violation! Device metrics are more like what you get with Nagios or SNMP polling, but the point is that you can decide here in the connector which of these to monitor.
If you use the AWS connector, for example, you can decide whether you want to monitor custom metrics that your apps or instances throw into the CloudWatch custom API. GroundWork will pick those up and make services out of them, and persist the metric data for you.
You can also decide how Cloud Hub will group the hosts and services it finds, to some degree. We automatically group things by function, so you will probably want to create some custom groups, see How to create Custom Groups, to handle organizing the results, but in the case of Amazon, you can also use tags. A tag you specify will be used to group the instances that have that tag by its value, so you can do the group management in GroundWork from AWS, if you like.
Once you Test and Save the connection configuration, then you can go into the metrics screen by clicking Next. We will use the Azure connector here, since it has a rich set of metrics.
The available metrics are discovered automatically when you connect. You get a default set configured to start with, but you can add any available raw metric. You can change its format with C-style standard modifiers. That’s not all! You can combine any set of metrics into a synthetic metric. This means you can look at ratios, like I/O to CPU, for example, or maybe change an absolute measure like used bytes on disk to a relative one, like percent full. There are lots of possibilities. We give you some preconfigured common unit conversions and other useful functions to normalize your metrics with.
Once you get the metrics you want to monitor from this source selected, you can choose to graph them. More on that in a second. You can also choose (if you want to) to set up thresholds for warning and critical levels, so that you can generate alerts.
This works just like most monitoring systems, and the synthetic metric dialog allows you to test the calculations with test input values to make sure you get alerted when you want. For customization of metrics see Customizing Metrics.
Now, lets’ make this active by saving the configuration and starting the connector (Save and Next from the Metrics screen), then START the connection.
A few other notes about this screen. Start enables a configured connector to begin the discovery and data collection process. If you decide you do not want to monitor a particular region, simply select Stop for the corresponding connector, the connectors configuration will be maintained for a subsequent start. Modify opens the Configuration page with a link to the Metrics screen. The Status option provides connection status information including error details. If a configured connector fails to connect, a connector-specific service will be updated to a Warning state, or Critical if you run out of retries (hosts will still become Unreachable and services Unknown if retries are exhausted). And, to stop and completely Delete a connection, see How to delete hosts.
That’s really it - add, select metrics, and start. You can change the metrics and thresholds any time, just by modifying the connector. If you want to have different thresholds for some hosts, just set up a new connector and restrict the access of the user you use to just those hosts, and your new thresholds will apply only to the hosts that connector sees. There’s no limit to the number of connectors you can have, but of course the system you run it on might get slow. In that case, just scale horizontally and implement another connector host.
Viewing the Results
So what do the results look like? Well let’s open Status Summary and see. Here is an Azure host running, and a few other resources as well, I can bring up the detail page for this host, and see the monitoring history and performance graphs, including those that are from Cloud Hub metrics.
You can also create Grafana dashboards with these metrics, and graph them alongside other monitoring data. It’s a very flexible and simple model, and supports the GroundWork Grafana data source, so all you really need to know is what host group you want to select metrics from to graph. A lot easier than getting a bunch of queries together, though of course you can do that too, if you need to get fancy.
A few things about Cloud Hub to watch for. One is the polling interval. Remember we mentioned it was Semi-rapid? Well, the threshold is currently at 1 minute minimum. The reason we settled on a minute is that some API endpoints charge by the access. Others are throttled by the vendor, like Azure is. What we found was that while we could get even large queries down to a few seconds, in general we wouldn’t get in trouble if we polled at 1 minute intervals.
Another thing is retries. These combine with the polling interval to give you a window to fix the connectivity to the monitored endpoint, or to allow a congested or throttled endpoint to recover. So if you have 10 retries at 5 minute intervals, you have a 50-minute window before the connector will shut down and all the monitored hosts become unreachable, and the services become unknown. You can set fewer retries or a shorter interval if you like, or set them to infinite. If you set retries to infinite, and the connector loses contact with the monitored endpoint, you will get a critical status on the connector service after 10 tries. That’s a limit we set in the XML configuration, and yes you can change it. The connector will keep trying forever, though, and never shut down or set the monitored hosts and services to unreachable/unknown.
If you set the retries to less than 10, then the connector service will be warning on the first failure until the last retry, and then it will turn critical, AND all the hosts and services will become unreachable/unknown.
This service is a good way to track your Cloud Hub monitoring, and get an indication early (without an alarm storm) when things go wrong.
Handling Alert Notifications
How do we handle notifications? During normal operations, the Cloud Hub keeps itself in synchronization with the virtualized server inventory and delivers monitoring data for hypervisors, VM containers, and network, storage, and other resource pools. Alerts are generated within the Cloud Hub when it is detected that the value of a metric exceeds its threshold. Such alerts are passed to the Status application for further analysis and processing by NOC operators and are also passed to the Notification and Escalation subsystem that applies its rules to notify those contacts that are scheduled to receive the alert at a particular point in time. It’s important to note here that these notifications don’t pass through the Nagios instance - they are an independent channel for notifications.
Some connectors have features that others lack: Google Cloud connector has a discovery feature. AWS connector can be made to support tags in EC2 for various functions. For detailed reference to individual connectors. To configure a Cloud Hub connection see Configuring Connections.