Standby Notification Server
Why use a Standby Notification Server?
If you have a requirement for redundancy that is not met by having a backup, you can use this option. Monitoring should generally be a continuous service, and should fail over automatically should the main server become unavailable. This is possible with GroundWork using the Standby Notification Server.
A second copy of the majority of your monitoring configuration is automatically synchronized to the Standby server on the fly. Any changes you make to the Nagios Monitoring configuration and GroundWork Messenger are all synchronized. Downtimes are also synchronized.
A Standby server won't send notifications through GroundWork Messenger until and unless the Primary is no longer available or healthy. Once the Primary recovers, notifications from the Standby are once again suppressed.
Standalone or as Parent servers
You can use the Primary-Standby pair in standalone mode, that is, as standalone active check servers with double monitoring. This is also possible when using GDMA, as GDMA can have multiple target servers defined, and so report to both the Primary and Standby. Standby servers don't build externals automatically, so the GDMA systems will start complaining after a few days that they can't get new configurations without the Primary online.
The main way Primary and Standby servers are deployed, however, is as Parents of Parent Managed Child or Child Managed Child servers. This guide covers setting up these deployment scenarios. You can, of course, mix these scenarios as needed to cover your particular deployment needs.
Not a HA solution
Standby Notification Server is not a full High Availability (HA) solution. Only specific tables of specific databases are replicated, and any active monitoring done by the Standby effectively duplicates that done by the Primary. There is no block-level synchronization of disks, redundant journaling, etc. Also, if you do fail over to Standby operation, you can't make changes to the configuration until you recover the Primary server - it is not an "A/B" operational model.
If your requirement is for a full HA solution, please contact firstname.lastname@example.org. We have such solutions for you, but they are not included with GroundWork and come at a premium cost.
You must have:
- Primary and Standby GroundWork hosts (servers)
- SSH (TCP/22) from Primary to Standby
- SSH (TCP/22) from Standby to Primary
- Postgres available on the Primary from the Standby (TCP/5432)
- You need to know the stable DNS name or IP address of both the Primary and Standby servers
- You may want to set a non-default password for the replication role so you should be ready to set one, or to store the randomly generated one that the installer can provide
- GroundWork UserID must be the same, with the /gw8 install in the home directory. GroundWork requires a UserID on the GroundWork 8 host, the ID doesn't matter, (we use user gwos), but it has to be the same on each system, Primary and Standby.
- The SSH key exchange required needs to be for this user, so it must be possible to SSH into the server as the GroundWork (gwos) user using ssh keys.
- The GroundWork user's home must contain the GroundWork 8 installation (gw8) subdirectory.
- If you need to change these parameters, you can, but the automated scripted installation will need to be adjusted. We suggest you contact GroundWork Support if these conditions don't meet your requirements.
GroundWork Monitor Enterprise
There are two versions of the Standby server, and they are specific to GroundWork Monitor Enterprise 8.1.3 and 8.2.0. Don't try to use a version of Standby that does not match your GroundWork version. This document is for the 8.2.0 version.
GroundWork Monitor Enterprise License
There's no restriction on using the license for the Primary on the Standby. You can simply apply the same license to both servers.
If you already have an existing GroundWork 8.2.0 server installed as a parent or a standalone, you can use this setup procedure to add a Standby server to it, making it a Primary. Of course you can add new GroundWork servers as well, and set them up clean in Primary-Standby pairs. Please note that the Standby server must start out as a clean install when you set up the pair, though the Primary can have a working configuration and need not be a clean install.
We provide a scripted procedure to set up a new Standby server to pair with an existing Primary server, as long is it meets the requirements. This procedure makes a backup for you, so you can revert and try again if you make a mistake or hit an error.
Throughout these setup procedures, we will assume that you have both the Primary and Standby running, set up under the gwos user. In addition, we require the gw8 subdirectory be located immediately below the gwos user home e.g., /home/gwos/gw8. This procedure will not work if you have installed GroundWork under the root user, or if you located the gw8 directory elsewhere in the file system. You can contact support if you need to verify that your servers are set up correctly.
Your choice of operational mode (Standalone or Parent) isn't relevant until and unless you decide to add Child servers. You can always switch the mode later if you prefer not to decide now, but we recommend having a complete plan before going through this procedure.
To set up the Primary server, the easiest way is to download the zip file which contains all the required scripts, plugins, and configurations. This procedure requires an offline backup, and will also bounce the GroundWork server, so be prepared for an interruption in monitoring.
- Download the Primary file. Place it in the gwos user home directory, this must be the directory immediately above the gw8 directory, and the location you land when connecting by SSH.
Expand the file:
tar zxvf sns-primary_8.2.0.tar.gzCODE
Change the directory:
Execute the set-replication-primary.sh script, giving the DNS name or IP of the Standby server as the argument, and optionally the password for the replication role:
./set-replication-primary.sh <Standby IP address> <password>
If you don't supply a password, it will generate one and show it to you so you can use it with the Standby. In either case, please note the password you use.
- When prompted, press Enter to continue.
When the process completes, copy the SSH key that appears on the screen to the Standby server in the gwos user's home .ssh directory, adding it to the authorized_keys file (if it exists), or creating this file if it does not. Make sure the file is restricted to the gwos user only with mode 600, as are the other files in this directory.
If you notice any error messages, STOP. Restore from backup, and report the error to support. You can't run this process twice without manually deleting the replication publications and several other changes first. If you find you need to do this, we recommend you restore the backup you took when you started and start again. You can always study the set-replication-primary.sh shell script, or contact GroundWork Support if you find you need to adjust something.
Similarly (and only AFTER you configure the Primary server as above), you can set up the Standby server. Note this must be a clean install of GroundWork 8.2.0. The graphs, events, and report history collected by the server will be unique to it, but the configuration for Nagios Monitoring, SLA reporting, downtimes, and notifications are all replaced by those from the Primary, so it makes sense to start fresh.
- Download (under SNS), the Standby file. Place it in the gwos user home directory.
Expand the file:
tar zxvf sns-standby_8.2.0.tar.gzCODE
Change the directory (depending on your location):
Execute the set-replication-standby.sh script, similarly, with the Primary server DNS name or IP address, and the password you used above:
./set-replication-standby.sh <Primary IP address> <password>
The password must match that used on the primary, so do not proceed until you are sure you have it.
When the process completes, copy the SSH key that appears on the screen to the Primary server in the gwos user's home .ssh directory, adding it to the authorized_keys file (if it exists), or creating this file if it does not. Make sure the file is restricted to the gwos user only with mode 600, as are the other files in this directory.
If you notice any error messages, STOP. Restore from backup, and report the error to support.
Cross-Monitoring Configuration (Required)
At this point, the Primary and Standby servers are linked, and any changes you make to Configuration > Nagios Monitoring or Configuration > Notifications are made on both servers. You will also need to set up the notification monitoring and management to avoid getting duplicate notifications, however. To do so, you will manually configure cross-monitoring as follows:
On the Primary server, access the Configuration > Nagios Monitoring > Control > Nagios Resource Macros section, and change the value of $USER18$ to the username of the GroundWork user (gwos, in our case). Update the macro value.
If you are already using $USER18$ for something else, use any other unused macro, but note it for adjusting this parameter later.
- Navigate to Nagios Monitoring > Profiles > Profile Importer > Import. Select Uploaded and import the profiles for Primary and Standby hosts and services (4 profiles in all).
- Create a new hostgroup for the pair (optional, but a good idea).
- Add the two servers using Configuration > Nagios Monitoring > Hosts > Host wizard. Apply the Primary and Standby host profiles to the respective hosts.
- Define a service dependency that keeps the dependent service from executing when the primary_health service is in an Unknown, Critical, or Warning state. We recommend adding a descriptive name like PrimaryNotWorking-suppress-turning-messenger-off.
- Define a service dependency that keeps the dependent service from executing when the primary_health service is in an OK state. We recommend the title PrimaryWorking-suppress-turning-messenger-on.
- Apply the PrimaryNotWorking dependency to the messenger_off and the PrimaryWorking dependency to the messenger_on services on the Standby host you added, as shown (Go to Configuration > Nagios Monitoring > Hosts, and drilldown to the Standby host to see the services). Make sure to select the primary server and then click Add Dependency.
- Commit the configuration. You will see a message at the bottom of the commit panel indicating commitscript.sh has run on the Standby, which confirms that the Standby now has the same configuration and cross-monitoring is active.
Configure the Standby for Optimal Operation (Optional)
It's likely that logging in to the Standby will be rare - you probably will only do so in emergencies. If you do, you shouldn't make any configuration changes to the Nagios Monitoring, Downtime, or GroundWork Messenger screens. If you do, you will break the replication setup, and you will need to recover it with the help of GroundWork Support.
For this reason, we recommend you disable the problematic screens. Here's how:
- Log in to the standby as an Admin role user, and select Administration > Menu Editor.
- Click Delete next to the Nagios Monitoring menu item and confirm in the dialog.
- Repeat for Downtime and Notifications.
You can always restore these options by adding the pages back with the Menu Editor if you need to, however they should never be used on the Standby, as these options are replicated from the Primary.
Stop Monitoring and Notifications for the Gort Container
On the standby server you will probably want to delete the Docker Cloud Hub connector. Either that, or make sure no notification rules fire when the Gort container is down. You can do this by adding an exclude rule for this host and it's services to your notification rules, but this is onerous if you maintain a lot of them.
The reason for this is that if you do notify for container issues, then you will get a series of 8 service alerts and 1 host alert when the Standby kicks in, since these messages are still queued up and have not expired by the time the Gort container that powers notification starts on the Standby. You will also get an appropriate alert for the primary_health service on the primary host - this alert is special and lets you know the standby is in operation.
To turn off the Docker Cloud Hub connector:
- Go to Configuration → Cloud Hub
- Select the Docker connector that is there by default and click Delete.
If there's more than one, make sure you are deleting the one that points to the local system socket by clucking Modify, and inspecting the configuration:
Procedure: Connecting Child servers to Primary and Standby
This optional section is relevant if you use Child servers (either Parent Managed Child or Child Managed Child). If you are just using the Standby and Primary as a pair, you can skip this section.
When using Child servers, both the Primary and Standby will be in Parent mode, and you will have at least one Child server operating in Parent Managed or Child Managed mode (or more than one of each mode).
Parent Managed Child servers
In this case, you will already have configured a Parent Managed Child server to connect to the Primary as described in the Deploying Parent Child documentation. To connect each Parent Managed Child to the Standby as well:
- Add the credentialed user to the Standby under Administration > Users, just as is described in the Deploying Parent Child link above.
- Access the Primary server and browse to the Configuration > Connectors menu option.
- Click on the existing child server connector:
- Click on the GroundWork Connections tab, and click New Child Connection:
- Add an entry for the Standby server like this, the standbyhost name is the instance name of the Standby parent server:
Child Managed Child servers
- To connect a Child Managed Child server to the Standby, first add the credentialed user to the Standby under Administration > Users.
- Go to the Configuration > Connectors menu option.
- Click the Local Nagios connection to open the details page:
- Click the GroundWork Connections tab:
You probably already have a parent connection to the Primary, since this is already a Child Managed Child server, and you already set it up according to the documentation. To add a new connection to the Standby parent, just click Connect to Parent, and replace the default name of the parent with the name of the Standby:That's it. The Nagios monitored inventory on the Child will now post results to both the Primary and Standby parents.
Exceptions and Future Enhancements
The Standby notification server is limited in its role. It is not an exact copy of the Primary. There are many reasons to think of it as one, but there are important differences and exceptions. These are listed here.
Note that most of these exceptions are mitigated in the case of using Child servers that do most of the monitoring, and that report results (over TCG) to both Primary and Standby parents.
Here's what's not duplicated:
- Cloud Hub: Cloud Hub configurations are not copied to the Standby server. As Cloud Hub is generally very easy to configure, however, this is not much of an exception. Future versions of the Standby notification server may include Cloud Hub connector configuration copies and associated double monitoring.
- Network Discovery: NeDi data is not synchronized to the Standby notification server. This includes network data capture containers. This may be added in a future version as well.
- Connectors: Other Transit Connection Generator (TCG) connectors, such as the Elastic connector, are not replicated to the Standby. They can be operated from the Standby by following the same setup steps on the Standby as on the Parent, or independently as a Linux service on a separate host.
- GDMA externals, auto-setup instructions: If you are using GDMA without Child servers, you can set the reporting of results to go to two (or more) target servers by listing them as a comma-delimited list in the Target Server directive of the GDMA agent, typically in the host external. See the GDMA documentation for more information. However, GDMA will only look for new configuration files at the first member of this list, typically the Primary server. This means that should the Primary server fail, GDMA will not accept new changes to its configuration. This is usually not an issue. The GDMA will continue operating for the time defined in the Poller_Pull_Failure_Interval value, which is 3 days by default. This is the time you have in which to recover the Primary server.
You can also configure GDMA to connect and report to Child servers, which removes the requirement to list multiple targets. However, in this case the target proxies the configuration file request to the Parent, and this is also set to the Primary, so you still need to recover it within the time allotted.
Auto-setup instructions and triggers are also treated in the same way as externals as far as GDMA is concerned. They are installed on the Primary, and placing them on the Standby will have no effect.
- Nagios Notifications: As of version 1.2.0, the standby server doesn't switch enabling and disabling notification via Nagios. If you need this functionality, please contact GroundWork support for assistance.