Auto Setup Troubleshooting
This page provides advice for what to do when things go wrong, and for how to prevent that from happening in the first place.
Translating Observed Behavior into Likely Causes
Error Detection and Handling
It is possible for problems to crop up in any sort of configuration setup. Errors can happen at multiple levels of the processing, ranging from initial intent to communication difficulties to operational problems to results analysis and final implementation. An important question is how easy it is to find out that problems have occurred, and to find out the nature of those problems once you know that the results are not what you desire. Thus we must discuss how to run tests, how to force error checking early on, where diagnostics can appear, and how to interpret diagnostic messages. The basic questions are:
- Are the auto-discovery instructions correctly constructed?
- Is the auto-discovery trigger file correctly constructed?
- Do we have available the auto-discovery results from the GDMA client?
- When was the last time auto-discovery was run?
- Was the GDMA client's auto-discovery apparently successful?
- If auto-discovery apparently succeeded, did it result in the expected set of services? (This question is particularly important when dynamic sensors are in play.)
- If GDMA Auto-Setup failed at any point, what were the errors?
The autosetup
tool can validate the setup files even before they are put into production (autosetup validate file ...
). That can be a great help in preventing errors before they occur, whether due to manual editing of such files, mistakes in automatic generation of such files, or mismatches between setup in those files and the Monarch objects they refer to.
Once the setup files are known to be valid, it is possible to initiate a dry-run discovery or discovery-plus-configuration action. A dry-run discovery-only action (last_step = "send_results"
in the trigger file) tells the GDMA client to run a pass of auto-discovery, and submit the results to the server along with a flag that says the results should be stored but not otherwise analyzed. A dry-run discovery-plus-configuration action (last_step = "test_configuration"
in the trigger file) tells the GDMA client to do the same, but to flag the results as being intended only for storage, analysis, and trial configuration, while not actually putting them into production.
Discovery results are stored on the server during both dry-run and live-action (place-into-production) actions. This provides some traceability in either case as to how the system is operating. Discovery results are stored only if they at least pass validation on the server as representing a valid discovery-results packet. This prevents arbitrary files from being uploaded to the server via this channel.
The autosetup
tool can be used to find and display the last auto-discovery results, whether dry-run or production (see the autosetup print results
and autosetup print analysis
commands), and to run a local dry-run configuration action on them (see the autosetup audit
command, which is not yet implemented). This provides the means to test each step of Auto-Setup processing in a convenient fashion, without endangering ongoing production monitoring.
To look for result, analysis, and log information:
- On the GDMA client:
- You can look at the poller logfile, found in the
.../groundwork/gdma/log/
directory. (The GDMA code has been extended with automatic internal logfile rotation, so at least as of GDMA 2.7.1, you should have logging enabled, via theEnable_Local_Logging
directive. See the GDMA 2.7.1 Release Notes. - You can specify a
results_file
on thediscover
command line, to capture into a file the discovery results that would otherwise be displayed in the terminal window. - Application-level error messages from the
discover
command are written to its standard output stream, and are currently not logged anywhere. The standard error stream is only used to report errors with the logging mechanism.
- You can look at the poller logfile, found in the
- On the GroundWork server:
- You can look at the registration-script logfile:
- In GW7, see
/usr/local/groundwork/foundation/container/logs/register_agent_by_discovery.log
- In GW8: see
/usr/local/groundwork/monarch/var/gdma/register_agent_by_discovery.log
in themonarch
container. The content of this logfile is included in the container-as-a-whole log output, but there it is intermixed with Apache logging, which can be a bit confusing.
- In GW7, see
- You can look at the registration-script logfile:
This logfile is written by the server code that receives and processes auto-discovery results from the GDMA client. Logged information from concurrent invocations of the server-side scripting is captured and serialized as a block in this logfile so it does not end up being interleaved and thereby rather confusing.
- The various
autosetup
commands display their output in the terminal window, or as you otherwise redirect the standard output stream. - Application-level error messages from the
autosetup
command are written to its standard output stream, and are currently not logged anywhere. The standard error stream is only used to report errors with the logging mechanism.
- The various
Here in great detail is what happens if an error is discovered at any point in Auto-Setup processing:
- In GDMA releases through at least 2.7.1, a failure in Auto-Setup processing is not reflected in the status of the
gdma_poller
service. That might change in future releases, but the issue is complicated.- If a pass of auto-discovery fails on a GDMA client, a question arises as to how that failure should be reported. For instance, suppose we have a host which has been previously configured and on which GDMA is operating normally, and we trigger that GDMA client to run auto-discovery, either dry or live. In this case, regardless of whatever other monitored services are to be configured via service externals, the state of the
gdma_poller
service on that host can be put into a WARNING state (for some form of dry-run action) or a CRITICAL state (for a live action).
- If a pass of auto-discovery fails on a GDMA client, a question arises as to how that failure should be reported. For instance, suppose we have a host which has been previously configured and on which GDMA is operating normally, and we trigger that GDMA client to run auto-discovery, either dry or live. In this case, regardless of whatever other monitored services are to be configured via service externals, the state of the
It's possible that some sites might generally drop the gdma_poller
and/or gdma_spooler
services from their GDMA-host configurations. If customers do keep them, there are issues with central-reporting-failure situations causing failure storms from all of the GDMA hosts, clogging up the event console and possibly not being consolidated into many fewer outgoing alarms. The possibility of alarm storms weighs heavily on our decisions on how errors are handled.
- Conversely, suppose we have an auto-discovery failure on a new GDMA client, one which has not yet been configured in Monarch. In that case, though the GDMA client may already know about the
gdma_poller
service, Nagios will not know about the host and therefore will have no place to record and publish such a failure.
- Conversely, suppose we have an auto-discovery failure on a new GDMA client, one which has not yet been configured in Monarch. In that case, though the GDMA client may already know about the
- If a GDMA client has difficulty fetching an instructions or trigger file, that fact is logged as an error in the poller logfile. The GDMA poller will continue to attempt to access these files on subsequent polling cycles, and therefore continue to generate local log messages for as long as the failure persists. Note that it is not an error for the client to ask the server for a file and get back a "Not found" response. (Either there is currently no file on the server to fetch, or an attempt has been made to fetch using a different form of the client's hostname, not the one actually used when the file was installed on the server.)
- If a GDMA client has difficulty interpreting a trigger file, the error-handling behavior is the same as when it has difficulty fetching the trigger file.
- Note that there can never be any mistakes from the GDMA client downloading both a dry-run and a live-action trigger file, or multiple trigger files with any other particular conflicting values of the
last_step
directive. That's because there is only one name for this client's trigger file, the latest downloaded copy always overrides any previously downloaded copy, and our convention of matching the trigger-file and instructions-file timestamps assures that we have the correct instructions to operate with.
That's the general picture, but there is one subtlety to watch out for, if you are not consistent in the way you install instructions and trigger files on the server. The GDMA client doesn't know whether the instructions and trigger files on the server were installed using a fully-qualified hostname or a shortname, and for that matter it doesn't know that you used the same lettercase as it is expecting. So it has a little dance it goes through to decide which form to use when fetching. That dance involves analyzing several options (Forced_Hostname
, Use_Long_Hostname
, and Use_Lowercase_Hostname
) to decide the exact form and lettercase of the hostname it will use. Also, if an initial fetch fails, it may try using a different form of the hostname. Generally speaking, when you install instructions and trigger files on the server (e.g., using the autosetup install -p
command), you should be consistent in specifying whatever form of the hostname is to be used in the monitoring of that machine, and that will minimize problems. If you do have issues in this regard, let us know about them so we can perhaps improve the client logic in this area.
- When a GDMA client first reads a discovery-instructions file, it analyzes the full file and looks for obvious errors before it runs any discovery actions. It such faults are found, they are logged locally. In addition, if the client has been told in the trigger file to send results to the server, a discovery results packet is still sent to the server, but it only pinpoints the problems in the file; no configured discovery actions are run. The GDMA client accepts a response from the server for the sending of this data, and log that response, but it does not change any of its ongoing internal configuration or behavior. In this case, the server stores the discovery results so they can be manually inspected, using the
autosetup
tool. In the current software releases, no notification is generated for a discovery failure. - If a GDMA client has difficulty running auto-discovery sensors, those failures are logged locally, and no changes are made to the GDMA client's configuration. In addition, if the client has been told in the trigger file to send results to the server, the same sort of error handling occurs as just described in the previous item.
- If a GDMA client has difficulty sending auto-discovery results to the server, that failure is logged locally, and no further action is taken in this polling cycle. The next polling cycle re-runs the discovery, hoping it will be able to successfully send this time around.
- If a GDMA client gets a bad (in some way incoherent) response from the server after sending auto-discovery results, that failure is logged locally, and no further action is taken on the client in this polling cycle. Depending on what happened on the server, some failure message may appear in the registration-script logfile there. The next polling cycle re-runs the discovery, hoping it will be able to successfully send this time around.
- If a GDMA client gets a bad (failed processing) response from the server after sending auto-discovery results, that failure is logged locally, and no further action is taken on the client, either in this polling cycle or in future polling cycles. The registration-script log on the server should contain more details. No notifications are sent out, because of possible alarm storms if you had lots of GDMA clients fail in the same way all at once. (We would want to have some sort of event-consolidation mechanism in place before we enabled such a thing.)
- If the server has difficulty parsing or otherwise interpreting and validating the auto-discovery results, it logs the failure on the server and sends a failed-processing response back to the client. The auto-setup server code does not currently send a failure event to Foundation.
The server does not touch the trigger file, so the GDMA client could in theory find it again at some future time and decide that it ought to re-run auto-setup based on that trigger. However, the client keeps track of the overall state, and knowing that the server processing failed, it will not run another pass of discovery until it is explicitly re-triggered by installation of a new trigger file on the server.
If the failure was strictly in parsing the results as a valid discovery packet, those results may be discarded and not stored on the server. If that happens, for forensic diagnostic purposes, those results should still be accessible on the GDMA client, in the .../groundwork/gdma/tmp/
directory.
- If the server has difficulty storing either the discovery results or results analysis files, it logs the failure on the server and sends a failed-processing response back to the client. The auto-setup server code does not send a failure event to Foundation. The file content involved in the failed storage attempt is discarded, but the discovery results should still be accessible on the GDMA client. Analysis results won't be there, but if the discovery results managed to get stored on the server, the analysis results can be regenerated from them for diagnostic purposes using the
autosetup
tool (that is, once theautosetup analyze
command is implemented). There probably won't be any need to do so, since solving the storage problem takes first priority and then you will likely want to re-trigger a new pass of discovery, to make sure the results are then completely up-to-date. - If the server has difficulty registering configuration changes in the
monarch
database, whether dry-run or live-action, it logs the failure on the server and sends a failed-processing response back to the client. That should stop the client from pounding the database with additional failures. It is up to the site administrator to notice the failure and take action that allows configuration to proceed. Specific hosts that had been asked to run a discovery can be re-triggered once the database problem is cleared up. If the administrator is unsure of which specific hosts were so involved, it may be possible to trigger a pass of live-action auto-setup on all GDMA hosts, withif_duplicate
set tooptimize
(or in the worst case,force
, thoughoptimize
should be tried first). Before you attempt that, think through the extent to which various service changes in your infrastructure may be inadvertently included at this time by such a run. The discovery results and the discovery-analysis results will have been saved on the server, and can therefore be inspected there (using theautosetup
tool) to aid in figuring out why the configuration changes could not be stored in the database. - If the server has difficulty building externals after registering changes in the monarch database, it logs the failure on the server and sends a failed-processing response back to the client. The fact that a failure occurred at this late stage means that it was definitely a live-action pass of auto-setup, not a dry run. The client will naturally continue to use whatever externals it may have previously downloaded from the server.
If a problem occurs while building externals, the auto-setup server code won't necessarily know if the failure was due to a database problem or a file-writing problem, since we don't generally distinguish those categories when reporting a failure to build externals. As before, it is up to the site administrator to notice the failure and take action that allows configuration (building externals) to finish. Specific hosts that had been asked to run a live-action discovery can be re-triggered once the underlying problem is cleared up. If the administrator is unsure of which specific hosts were so involved, it may be possible to trigger a pass of live-action auto-setup on all GDMA hosts, with if_duplicate
set to optimize
(or in the worst case, force
, though optimize
should be tried first). Before you attempt that, think through the extent to which various service changes in your infrastructure may be inadvertently included at this time by such a run.
Generally speaking, the GDMA client looks at all the information it has available as regards local or server-side faults, and it makes the best decision possible as to whether it should either re-use the last instructions and trigger files it downloaded from the server for another pass of auto-discovery at the next polling cycle, or block such future runs until it receives a new trigger file. In any case, the client continues to store both files on the client for forensic diagnostics. The stored copies might also be used during the fetching of new copies of these files, so the timestamps on the stored copies can be sent to the server even if the GDMA client has been bounced since the last fetching. With that data, the server can decline to send new copies and instead respond with a 304 Not Modified if nothing has changed on the server side. This is a small optimization, to avoid excessive bandwidth being used in fruitless repeated downloading of unchanged files.
To avoid some of the possible indecision points noted above ("which servers do I need to re-trigger after fixing some problem?"), it is always best to check the results of auto-setup soon after you expect them to have completed. The autosetup
tool is your friend in that regard.
Specific Error Messages
Most error messages should be self-explanatory. A few of them reflect a fairly complicated underlying condition which is better explained in detail here. The specific object names mentioned here are arbitrary and do not necessarily reflect the actual objects you may encounter in your configuration.
Message | Description |
---|---|
when checking the intended setup for service 'linux_load', found duplicate values of instance_suffix ('_foo') in sensor results yielding service 'linux_load>' and host_profile 'gdma-linux-host' | Most likely, service
|
when checking the intended setup for service 'cacti', found duplicate values of instance_suffix ('_foo') in sensor results yielding service_profile 'ssh-unix' and service_profile 'service-ping' | The two named service profiles are both assigned the same
|
when checking the intended setup for service 'myapp', found duplicate values of instance_suffix ('_foo') in sensor results yielding service 'myapp' and service_profile 'service-ping' | Most likely, service profile
|
In all of those situations, the approach is the same — one side or the other needs to be gracious and step aside, to avoid the conflict.
Note that the solutions suggested above only take into account the impact on running discovery itself. There may be other implications of adjusting the content of host profiles or service profiles. Consider the bigger picture before you make changes. For instance, you might wish to avoid using sensors that specify a service_profile
directive, and instead just rely on sensors that include a service
directive to configure an individual service.
In the future, the conflicts noted in those error messages could potentially be ignored if in fact there were no differences in the details of the generated setup for the two matching sensors. That would take additional checking of the discovery results, which we have not yet implemented in the current discovery registration code.
Test Procedures
Speed of testing is an important consideration; everyone wants fast turnaround when changes are made to either the auto-discovery instructions or the setup in the Nagios configuration tool, to verify that they will have the intended effects. So how can you make changes and quickly trigger a re-discovery on the GDMA client without waiting for the full client production polling cycle to come around so another trigger file on the server is found and auto-discovery gets run? If the GDMA machine is not yet in production, you could shorten the polling cycle there, but that's ugly because you don't want to go modifying the client setup just for test purposes. To solve that problem, we package the auto-discovery code in a manner that can be run independently of the GDMA client poller, as a separate program running most of the same underlying code. Then on the GDMA client, we provide such a program, called discover
, that is able to initiate auto-discovery and run it.
In contrast to the actions of the GDMA poller, the discover
test tool does not reach over to the server for instructions and trigger files. Instead, it relies on you to place copies of the files you want to test, directly on the GDMA client. That makes sense for the scenarios in which discover
will find use, where you are quickly verifying the operation of those files and modifying them in an immediate-test situation. Once things are working to your satisfaction, you can move copies of the files you wish to preserve back to the server and into your master repository.
File locking is used in both the poller and the test tool, to block both programs from trying to run auto-discovery at the same time. This locking also blocks more than one copy of the test tool from running at the same time. Note that this locking does not prevent the poller from downloading new copies of the instructions and trigger files from the server while you are testing with the discover
tool, so you will likely want to be editing and testing some copies of those files which are separate from those used by the GDMA poller. (The poller stores the copies that it downloads from the server in the .../groundwork/gdma/tmp/
directory, whose absolute location depends on the GDMA platform.)
discover
logs only to the standard output stream, except maybe in really exceptional conditions where the standard error stream might be used if an internal program bug is sensed. So errors and notices are spilled into the terminal window by default, and there will be no logfile to look in unless you redirect the standard output to a file.
See Tools and Files for Auto Setup for usage of the discover
tool.
Exception Cases
Client Access of Trigger and Instruction Files
When a GDMA client reaches over to the server for auto-discovery trigger and instructions files, it must use particular filenames for said purpose. Particularly for the initial configuration, that may turn out to be problematic. That's because the triggering of a new pass of auto-setup needs to be controllable on a per-host basis, meaning the files must be named on a per-host basis. But the obvious ways to name such a file, using either the fully-qualified hostname or the client IP address, may sometimes be unreliable.
- Some customers use complex subnetting and perhaps NAT setup that might end up with multiple machines appearing to themselves like they have the same hostnames.
- Machines can have multiple network interfaces, with no clear priority as to which one should be chosen to represent "the" IP address of the machine.
- Machines can have multiple hostnames, depending on which network interface and IP address is used to derive the hostname.
- Sometimes DNS goes down, so any mapping you might want to depend on from that source might be inaccessible just when you happen to need it.
- At some customer sites, DNS is not even available, perhaps because the GDMA client sits in some DMZ enclave where DNS is deliberately disabled.
- At some customer sites, a DNS lookup might return the wrong result. One such failure mode ends up declaring the hostname to be
localhost
, no matter what the actual hostname is.
If you suspect such situations might be interfering with use of Auto-Setup at your site, the first step toward diagnosis is to look in the client's poller logfile. That should list the filenames that the client used in its attempts to fetch the files.
In the general GDMA case, we solve this unsolvable problem by having implemented a Forced_Hostname
directive. The value for this directive is returned from a successful auto-registration, and recorded in the gdma_override.conf
file to be used thereafter in communications with the server. That sort of thing may or may not be appropriate in addressing the situations described above.
The best fix is to arrange system-configuration data available to the GDMA client so it automatically picks the right hostname. That might involve fixing your DNS to correctly reflect the GDMA client, or it might involve adding the right hostname to the client's /etc/hosts
or C:\Windows\System32\drivers\etc\hosts
file. If such strategies do not work, contact GroundWork Support for assistance.
Related Resources
-
Page:
-
Page:
-
Page:
-
Page:
-
Page:
-
Page:
-
Page:
-
Page:
-
Page:
-
Page: