Best Practices
Back Up
Back up the entire system
- You can easily back up everything unique about your GroundWork installation with the process described in System Back up Restore.
- Whenever you are about to upgrade to new versions, add connectors, or otherwise adjust the system as a whole, we suggest you do this.
- It does take monitoring offline for a few minutes, so please ensure you have the redundancy you need in place to ensure continuity. You can also backup and restore just the databases, but this is not all-inclusive and can possibly lead to data loss, depending on what changes in between.
- Since you can only restore to the exact same version you back up, keep a copy of the prior release installer on disk in case you need it. Then you can restore to the prior version by removing the old containers and volumes, re-installing the old version, and restoring the backup you made.
Back up Nagios monitoring configuration
- Back up the Nagios Monitoring configuration. Always create a backup prior to making changes to the Nagios configuration to ensure changes can be easily recovered if necessary.
- Each time you implement a change, a backup is automatically created, but in this case the change is already in place in the database. We recommend doing a backup before starting a working session, just to be sure.
- Use annotations on your backups. The backup page lets you add text to make it easy to identify which dated backup is which. You may wish to use a convention, as you might for comments when you commit code to a repository, for example.
- To make a volitional back up, navigate to Configuration > Nagios Monitoring > Control > Backup and restore. Adding a note to identify the backup can be helpful if you need to restore, additionally locking a backup avoids eventual automatic deletion. See Configuration Backup and Restore.
Containers
Don't stop some of the containers
- You might think it doesn't matter that some of the containers are not running, and be tempted to stop them to save on CPU or RAM. This isn't a good idea - pretty much all the containers are there for good reason, and there's a lot of inter-dependency. You can, however, not install the Elasticsearch, Logstash, Kibana, FileBeat and curator containers by de-selecting them in the installer user interface on clean installs. If you want to stop using these containers in your upgraded systems, please contact support for instructions and we will help you get them deactivated in a supportable way.
- If you are otherwise having resource constraints, please let us know by submitting a support request in GroundWork Support, and we'll help you tune the system as needed.
- In general you may find you need more disk and RAM than you did in GroundWork 7.x to work properly, please see the System Requirements page for details.
Distributed Monitoring
Use GDMA monitoring
- We have not changed the open nature of the GroundWork Distributed Monitoring Agent (GDMA Monitoring). It is still all laid out in a clear file structure that is familiar to those used to working with GroundWork servers and agents. You should use it to make the most of GroundWork, by distributing the monitoring workload out to the monitored servers, including Windows™, Linux, and other platforms.
- Use the GDMA auto-setup feature to fully automate the detection and monitoring of all the resources on your systems and to make it simpler to adjust thresholds on service checks you configure in the user interface.
Plugins
Tell us what plugins you add
- Many Nagios and Nagios-compatible plugins can be run from GroundWork out of the box, but we recognize we don't have every possible plugin loaded. If you do find you need to add some, you can copy them into the Nagios container, and even add any missing packages with entrypoint scripts. Let us know what you do add by filing a support request, and we will make them available natively if it makes sense, or at the least add your dependencies to the next release.
- If you prefer to add your own dependencies, or your requirements go beyond what is easily containerized, then you can always use GDMA. The file structure is clean and open in the GDMA package, and is easy to add to, customize, and use for specialized monitoring.
Common mistakes in writing monitoring plugins
The following sorts of issues seem to come up repeatedly. You would do well to learn from the mistakes of others.
- Dumping error messages from outside sources directly into the plugin output, without any sort of filtering. Such text may contain vertical-bar characters, which are not allowed in Nagios plugin output for any purpose other than separating status text from perfdata items. If you do not totally control the status text within the plugin itself, and somehow obtain part of it from some outside source, be sure to filter the text to either just strip vertical bars entirely, or substitute them with some other safe character such as
#
that will not be mistaken for anything that might normally appear in the status text. - Having plugins produce excessively long status text. There are limits imposed by good sense, and limits imposed by what the downstream databases can reasonably handle. Plugin output is not a vanity press where you should be publishing War and Peace. Find some way to summarize the most critical points of interest and point to some other place that contains detail.
- Not paying attention to the Plugin Development Guidelines, as regards the exact format of perfdata items and the separation between them. The on-line guides are imperfect in their description of how perfdata items are handled in multi-line output. But the format of individual perfdata items is documented accurately, and must be followed exactly. Otherwise, perfdata gets dropped on the floor and clogs up logfiles, repeatedly. Don't let that happen to you.
- Not testing the plugin under both normal and adverse conditions. You got it, your plugin will somehow misbehave in unexpected ways when the world it is probing is not in the hoped-for condition. Don't let your first discovery of that fact be in a production context. Forget the fine details of that SQL query for a moment; were you even able to connect to the database in the first place? Forget the fine details of extracting a value from some XML you got from a webserver; what happens if that XML doesn't contain the tag you're looking for, or the webserver occasionally returns something other than valid XML, or you weren't able to connect to the webserver in the first place?
- Defaulting to an okay state. If your plugin cannot determine the actual state of the monitored objects, it should return the appropriate UNREACHABLE (for a host) or UNKNOWN (for a service) value, not something that masks a problem by pretending nothing bad has happened. If you get such states often, that's your signal that you need to discover the underlying cause, not ignore it.
- Not parsing the plugin arguments robustly. Does the plugin detect when you have reversed the intended order of arguments, or when you have omitted some required arguments, or when your arguments are out of range?
- Including timestamps in status text without an accompanying timezone. These days, monitoring results may be seen by multiple people all around the world, far from where the probing actually happened. It's very difficult to know how time that the plugin knew about is related to time that you see locally, which may be several timezones away. While in general it is best not to include timestamps in status text to begin with (and rely on other aspects of the monitoring to provide time data), if you have to do it, it needs to be labeled properly.
- Not versioning your plugin. If I have two distinct copies of the plugin file, do I have to read the entire code base to figure out which one to use? Or does it respond to a
--version
argument and spill out the version number? - Not running a code review. Plugins tend to be short, so there is a strong temptation to believe that you must have gotten it right the first time. Don't fall for that; fresh eyes can see things you cannot.
- Providing little or no documentation. Does your plugin spill out a readable usage message if you invoke it without arguments? What is the range of devices, models, or circumstances that the plugin covers? What do the plugin options mean? Without having such stuff written down, nobody can tell whether the plugin actually addresses its intended area of coverage, or how or even whether they should apply the plugin to new situations. For that matter, nobody can tell whether the plugin was applied correctly in the first place, in the places where you have already used it.
Remember, the best way to recycle hindsight is to use it as foresight. Using it as mulch is way less effective.