Sometimes it's the little things

If you run an illumos distribution that uses beadm (e.g. OmniOS) in production, you may have run into illumos bug #5943.

As you might imagine, we at OmniTI run a lot of OmniOS systems. Recently a machine was being upgraded and on reboot was misbehaving because of that issue. That got me thinking about prevention.

We monitor our systems using among other things, resmon feeding into Circonus. We can set up alerts in Circonus when certain metrics cross thresholds. Until #5943 gets fixed (probably by the integration of #5061) I decided that it makes sense for us to monitor how many BEs are in existence on each global zone.

The first step was adding a module to resmon that counts how many BEs there are on the system.

We're now updating our configuration management code to enable that check and metric in Circonus, and to trigger a non-critical alert when that number gets moderately high. I'm hopeful this will prevent stressful situations where machines fail to boot correctly going forward.