Provides best practices, lifecycle management and other practical advice for GDMA Auto Setup.

Best Practices

Managing Master Files

  • You must decide where to store the master copies of your instructions and trigger files. This should be outside of the GroundWork product itself, so as not to risk potentially losing such data during a product-version upgrade or migration to some other server.
  • Speaking of risks, the master instructions and trigger files that you maintain are precious cargo. You are strongly advised to put them under regular backup control.

Categorizing Instructions Files

If the auto-discovery instructions file passed to each GDMA client were precisely tailored to the resources to be monitored on that machine, there would be little point in running auto-discovery. Instead, it would make sense to just apply that setup directly to the Nagios configuration on the server, and be done with it. The advantage of auto-discovery is that it allows detailed provisioning of applications on the GDMA client to be decoupled from setting up the monitoring of whatever resources will be present on the client machine. To that end, the instructions file that gets passed to the GDMA client must be in some fairly general form, allowing the discovery process on the client itself to pick and choose which parts actually apply. This should minimize the need to update the content of the instructions file every time a pass of reconfiguration is needed.

Keeping that principle in mind, you should lean strongly toward constructing very general instructions files, naming them carefully, and sharing them widely. Because you do want control at the individual-host level in some cases, each host will get its own copy of an instructions file. But there should be groups of machines of similar base characteristics whose instructions files will simply be unaltered copies of a generic instructions file for that class of machine. It's up to you to decide what class boundaries make sense in the context of your infrastructure.

If all such generic files are not collapsed together into a single globally-applicable file, you will be responsible for keeping track of several generic files and the mapping between each GDMA client and the class of machine represented by its corresponding particular generic file. That mapping will be needed in order to put updated copies of the correct generic file into place when a reconfiguration is needed on a particular host. In the best case, if strict naming or address-assignment rules are consistently followed, the mapping could be derived only from readily-accessible data such as a hostname or IP address, instead of requiring a table lookup in some external configuration database.

Configuration Conventions

  • Use service instances in the Nagios configuration tool (and generate them in your instructions files) where they make sense, which is when you have multiple copies of the same kind of resource running on the GDMA client. This should collapse down the complexity of your configuration. Naming of service instances will be controlled in the instructions file by use of the instance_suffix directive within sensor definitions.
  • Your service externals should make good use of the new macros that are supported when building externals.
  • The following applies to specifying a service_profile directive in your sensor definition, and in a similar vein to when you specify a host_profile directive (if that host profile has some service profiles attached).

The externals_arguments directive in a sensor definition will only define one set of argument values, regardless of how many services are assigned to a service profile defined for that sensor. Those same externals arguments will be applied to all services in the profile. So all the service externals for services in that profile need to agree on a common usage pattern for those arguments ($ARG1$ means this, $ARG2$ means that, and so forth). This means the service profile should be arranged to only include a minimum number of services (maybe just one) where such agreement can be easily reached without conflict. This fact will affect how service profiles are defined in the Nagios configuration tool, and how auto-discovery instructions files are designed.

It is this complication that makes the service_profile directive less useful than just using separate sensor definitions that might duplicate some part of the setup, but each with its own service directive. That allows you to tailor the argument-related sensor directives as needed for each service.

So the general advice is, use service_profile sparingly, if at all. And don't include complex services or sets of services in service profiles attached to a host profile that you use with GDMA Auto-Setup.

Testing

  • If you are using dynamic sensors, put in place whatever extra automation or human-run procedures you need to ensure that the intended set of resources got discovered, and that they did not get skipped because they happened to be down at the time the pass of discovery was run.
  • Particularly when developing new instructions, use a dry-run trigger file to perform experimental runs of Auto-Setup before putting changes into production. Examine the discovery results and analysis with the autosetup tool, to decide when you are ready for deployment.

Deployment

GDMA clients will run a pass of discovery as soon as they start a polling cycle and find an installed trigger file on the server that is newer than the installed instructions file. Discovery results are sent to the server at the end of that pass, and if this is a live-action run, configuration changes will be put into the database right away, new externals will be built, and the GDMA client will immediately begin monitoring with the new configuration. However, Nagios itself won't understand any new services at that time, because its own running configuration will not have been changed. Recognition of such checks will only occur at the next Control > Commit operation. Until that happens, check results from the new services will simply be dropped on the floor.

The net effect is that some degree of temporary mismatch will naturally occur. You can minimize it with some simple tactics:

  • Use dry-run testing if you are unsure whether discovery and/or analysis will produce the desired results.
  • When you are ready to update the client configurations using Auto-Setup, make sure you don't have any half-done Nagios configuration changes in progress that are not related to Auto-Setup. That's because when you perform a Commit, whatever you have half-done in those other areas will suddenly become part of the Nagios configuration.
  • When you do want to update some GDMA client configurations, install the instructions and trigger files as needed using the autosetup tool, for as many clients as you want to change. Then wait long enough for a full GDMA polling cycle to pass on all of those GDMA clients, and only then run a Commit operation on the server.

Lifecycle Management

GDMA Auto-Setup provides both mechanism and tooling to automate configuration setup. It is up to the customer to provide policy and implementation, based on local needs. To that end, we must discuss the full lifecycle of a GDMA client, to put everything else in perspective.

Individual GDMA hosts which are to be managed by Auto-Setup follow certain well-defined steps in a complete lifecycle. They are initially deployed into the infrastructure (provisioning); they are set up for application use and for monitoring of that activity (initial configuration); they probably have changes imposed over time, and require adjustments to both applications and monitoring (reconfiguration); and they may be eventually pulled from production use (deprovisioning). Each of those steps has to be supported in a well-defined and well-understood manner.

Provisioning

Machines are procured, installed into the infrastructure (or allocated as VMs), have operating systems installed, and are set up for network access. Decisions are made about what applications will be run on which machines, and those decisions are recorded somewhere so the infrastructure can be maintained over time. The applications are installed and start running.

Some time after the operating system is installed, the GDMA client software will be installed. A few basic configuration options are chosen at install time, most notably which GroundWork server the GDMA client will interact with, and now also the fact that GDMA Auto-Setup will be enabled. The GDMA system service will be started so monitoring can begin, but at this point the GDMA client knows nothing about the specific applications and other system resources to be monitored or how to do so.

Alternatively, the GDMA client software will be incorporated as part of a VM image, and so will be ready and running as soon as the machine first starts up. The ordinary GDMA install-time options will have been set when GDMA was installed on a template system for creating that VM image, and perhaps adjusted manually thereafter before the VM image is cut. It will still need local discovery and then configuration from the GroundWork server before it begins useful monitoring.

Initial Configuration

Upon startup, the GDMA software will realize that it has no configuration data, so it will reach over to the server looking for it. Not finding any at this point, it will initiate a cycle of Auto-Setup, downloading discovery instructions and a trigger file, running a pass of discovery, and sending the results to the server. After a successful response is received, the GDMA client will back over to the server for the configuration data that should now be present there. It can then begin monitoring and sending in the results of that monitoring.

To make that all happen, the customer must provide discovery instructions that detail what resources to look for, how to look for those resources, and what sort of server-side configuration should result if the resources are found. The customer must analyze the kinds of resources they care about, build a discovery instructions file that will encompass all the possible resources of interest, and deploy it on the GroundWork server where the GDMA client can find it. Also, any services that might be run on the GDMA hosts must be created in the Nagios configuration tool in generic form, and host and service profiles mentioned in the discovery instructions must be created so they are available to be applied during the client-configuration phase of Auto-Setup based on the details of the discovery results.

A particular protocol must be followed to make the discovery instructions available to the GDMA client. That protocol involves careful management of installation paths and installed filenames, using atomic operations to move the instructions and trigger files into place, and manipulating file timestamps. On the server side, the details of this protocol are handled by proper invocation of the autosetup tool. The trigger file controls how often discovery is run; each GDMA client will only run a pass of auto-discovery if:

  • The GDMA client finds both the discovery instructions and a sibling trigger file on the server.
  • The trigger timestamp is later than the instructions timestamp.
  • There has been no pass of discovery run on the client since the timestamp on the trigger file. This condition avoids loading the server with continual attempts to re-run the configuration process.

As part of managing this setup, the software supports certain dry-run capabilities that can be used to test sets of auto-discovery instructions while they are still under development. This testing can happen at multiple levels:

  • checking consistency and correct construction of the instructions themselves
  • checking operation of auto-discovery on the GDMA client without disturbing the ongoing production monitoring
  • ensuring that auto-discovery results could be used to successfully modify the GDMA client configuration on the server, given the actual available host profiles, service profiles, and generic services on the server

If the initial pass of auto-setup fails because of either auto-discovery or configuration problems, the GDMA client will remain uninitialized. In any case, if the client was able to run discovery and send results to the server, those results will be saved on the server for inspection and diagnosis.

Reconfiguration

We don't expect the set of monitored resources on a GDMA client to remain unchanged over the lifetime of that machine. Therefore, we need a way to cause Auto-Setup to execute another pass on demand, running auto-discovery again to see what has changed, re-registering on the server and modifying the GDMA client's monitoring configuration, and rebuilding externals so the GDMA client will thereafter monitor an updated set of resources.

Reconfiguration changes on the GDMA client can arise from multiple sources, so it is up to the customer to decide when it is appropriate to initiate a new pass of auto-setup. A different set of resources already covered by the instructions file may be deployed. Or the instructions file may need modification to add, change, or delete resource sensor definitions. Or the profiles or other setup on the server might be changed in a way that could result in different client behavior once that setup is combined again with auto-discovery results, even if the discovery results are themselves identical to those from the previous pass of discovery. In any case, the same dry-run test capabilities as were used for initial configuration may be exercised before a live-action trigger is put in place to make production changes on both the server and the client.

Since two files (the instructions file and the trigger file) are involved, a particular protocol must be followed on the server so the GDMA client can be guaranteed that the two files form a matched pair. One file will be updated first and the other second, and if we are not careful there will be race conditions that could result in running auto-setup under conditions where it is not desired. For instance, under the wrong conditions a set of provisional auto-discovery instructions used in dry-run testing could be accidentally used for live-action production setup. To prevent that, use the autosetup tool in the manner described elsewhere in these pages, installing the instructions first and the trigger second.

If any part of a reconfiguration pass of auto-setup fails, all of the monitoring setup for that GDMA client will be left as-is. Such failure can include problems in interpreting the instructions file, problems in running auto-discovery, and problems in establishing a consistent configuration on the server. As with initial configuration, as much data as possible will be saved to make it easy to diagnose the source of any failures.

The extent of reconfiguration depends on the change_policy in effect. Currently, GDMA Auto-Setup supports only non_destructive reconfiguration. That mode is something of an extra-safety measure, and it will only add new configuration settings (e.g., new services or new service instances). It will leave alone all existing configuration data that does not match the new pass of discovery results. So if you wish to remove services or service instances, or you know the setup to monitor them must change, you will need to go in and remove those objects from the configuration before running a live-action discovery. If you think this might be an issue for some host, you can always do a full dry-run discovery (last_step = "test_configuration") to check.

GroundWork may consider adding other change_policy settings in the future. For instance, a from_scratch policy would be zero-based, meaning the entire new setup would be calculated using just the latest auto-discovery results together with current Monarch host/service profile and generic-service setup. No reference would be made to the existing setup for that GDMA client. However, an important issue in reconfiguration is the extent to which any configuration changes made outside of GDMA Auto-Setup processing will be preserved or overridden during later passes of Auto-Setup. There are no easy answers here. On the one hand, we want Auto-Setup to have authority to delete monitoring configuration for resources that are no longer present, and to adjust the configuration for resources that are still present but where the instructions now say to monitor in a different manner. Operation in this way provides ease of administration, by automating simple changes. But if the discovery instructions use dynamic sensors, there is some danger that the objects they sense might happen to be only temporarily down, and we don't want such a circumstance to silently remove monitoring of resources we care about.

An alternative change_policy might be ignore_extras, which would modify existing configuration data that does not match the current discovery results, but leave in place any extra config data beyond that which is driven by the current discovery results. Logically, it would identify and preserve outside-the-box configuration changes, but details of such a policy remain to be worked out. The present code does not support that because we don't yet have a good feel for what use cases would justify retaining such exceptions, as opposed to having them be normalized by being specified in the auto-discovery instructions.

Deprovisioning

Deprovisioning can happen at two different levels.

  • Some set of resources is taken out of operation, and should no longer be monitored.
  • The entire GDMA client itself is taken out of operation.

If we supported a change_policy of from_scratch and that were in play, deprovisioning at the resource level would just be a matter of reconfiguration, and would be handled as such. See the previous subsection. As long as the configuration for all resources were under control of Auto-Setup and there was no worry about dynamic sensors might inadvertently skip some desired resources and end up deleting them, a reconfiguration pass of auto-setup would be initiated to recognize that some resources are no longer to be monitored, and they would be removed from the configuration for this GDMA client.

With other change_policy settings in play, it is up to the administrator to delete the appropriate services and service instances from the relevant hosts. Service instances need to be handled on a host-by-host basis, because of the detailed nature of such adjustments. On the other hand, if entire services are to be removed from hosts, the (GW7) "Configuration > Hosts > Delete host services" or (GW8) "Configuration > Nagios Monitoring > Hosts > Delete host services" screen may be of use, both for conveniently finding the relevant host services and for deleting them in bulk if you need to deal with more than one host.

With respect to GDMA Auto-Setup, we only care about resource-level deprovisioning. Complete host removal can and should be handled through other means, either by manual host deletion via the Nagios configuration UI or by scripting. (For convenience in the UI, see the [GW7] "Configuration > Hosts > Delete hosts" or [GW8] "Configuration > Nagios Monitoring > Hosts > Delete hosts" screen. For convenience in scripting, see the [GW7] /usr/local/groundwork/core/monarch/bin/monarch_delete_host or [GW8, in the monarch container] /usr/local/groundwork/monarch/bin/monarch_delete_host script. The same action can be handled at a lower level through the dassmonarch package.) Since it is the customer's responsibility to manage auto-discovery instructions and trigger files, whatever procedure or scripting the customer uses to remove a host from the Nagios configuration should also take care of removing the installed instructions and trigger files for a deleted host, along with auto-discovery result and analysis files. For those, see the autosetup remove command in Tools and Files for Auto Setup.

Customer Responsibilities

Given all the material above, here we summarize what the customer must do to take advantage of GDMA Auto-Setup.

The customer will have ongoing responsibilities for:

  • Editing the bronx.cfg file on the server to establish whatever encryption algorithm and password is desired for all NSCA-based communication from GDMA clients to the server.  Conversely, if the site is using only recent versions of GDMA, you can set all of them up with Spooler_Transport = "HTTP", and you know there won't be any other kind of NSCA traffic, you wouldn't need to worry about that part.
  • Editing the bronx.cfg file on the server if you do have Spooler_Transport = "HTTP" in play for some of your GDMA hosts, and you have some of those GDMA clients not within the standard reserved-IP address ranges.  In that case, you will need to adjust the http_listener_allowed_clients setting.
  • Designing, creating, and managing auto-discovery instructions files that reflect the resources the customer wants to monitor.
  • Initial provisioning of a GDMA client:
    • loading GDMA on the client, with approprate gdma_auto.conf configuration as mediated by the installer or additional post-install editing
    • editing the send_nsca.cfg file on the client if necessary, to mirror the server settings for the chosen NSCA encryption algorithm and password
    • possibly, installing SSL certificates on the client to mediate encrypted HTTPS connections
    • constructing and installing an appropriate discovery-instructions file for this GDMA host on the server
    • running any dry-run discovery and configuration passes desired before putting the GDMA client machine into production
    • checking the results of auto-discovery dynamic sensors for completeness
    • dropping a live-action trigger file into place on the server
  • Ensuring that any local Monarch overrides are in place, if auto-discovery results are not quite sufficient.
  • Reconfiguration as desired (updating the instructions file as needed; dropping a new trigger file) when the set of services to be monitored on the GDMA client changes.
  • Deprovisioning of an entire GDMA client (deleting the host from Monarch; removing all related auto-setup data such as the auto-discovery instructions, trigger, results, and results analysis files).

Current Limitations

  • rationale for some of those limitations, such as no automated deletions as of yet
  • multi-host mode is not supported (the current crop of sensors only looks locally)
  • (once constructed, copy this list into GDMA-424 or some follow-on JIRA which will record the remaining residual work)
    • The soft_error_reporting option in a trigger file is currently ignored.  Implementation of this feature has been delayed until a later release of the server-side code.
    • The trigger file may optionally contain a change_policy directive to override the default_change_policy set in the server register_agent_by_discovery.conf file. However, only the non_destructive setting is currently supported for either the default setting or the override. Implementation of alternative change policies has been delayed until a later release of the server-side code.
    • (mention other options which are not yet functional in the server-side config file)
    • (mention that the audit feature of autosetup is not yet functional)

Known Issues

  • certain trigger options are provided for in the design but not yet implemented.
  • Only a change_policy of non_destructive is currently supported.
  • We ought to have the change_policy settable at the sensor level, not just as a system default and as a trigger-level override.
  • some issue with applying externals? just a vague memory, maybe not even a problem at all
  • does not currently handle automated deletion (see current limitations, above)
  • review all of my Slack interactions with Joey for material that belongs here
  • label all limitations by the GDMA version number to which we are referring, 2.7.1 for now and labeled as applying to later versions as well until explicitly marked as rescinded as of some specific later release
  • Upon failure on the server to process GDMA discovery results all the way to completion, errors are logged on the server, but no events are generated in Foundation where they are likely to be seen (or might be used to generate an alert).  It is therefore up to the customer to scan through either the discovery analysis or the configuration setup to verify that all went well.

Future Directions

  • Currently, matching of a single sensor's pattern is used to activate inclusion of the associated host profile or service into the generated configuration.  It's possible that there might be some advantage to supporting logical combinations of matching or not matching multiple sensor definitions as the final activation gate for such inclusion.  We don't have any particular examples of this in mind, but if you believe it would be useful in your context, please let us know.
  • Comments are supported in auto-discovery instructions files, and can be used to track classes and versions of these files, using your own conventions. This could help in discerning which versions are currently in use on which hosts. If this idea turns out to be important to you, we can define particular global directives for a customer-specified machine-class string and a customer-specified version string, so this information could be readily extracted by our tooling and displayed in reports.

Request for Feedback

Managing instructions files is clearly customer-specific, but we would like to hear about customer experiences in that regard so we can improve our description of Best Practices on that topic.

Related Resources