At TrueAccord, we take our service availability very seriously. To ensure our service is always up and running, we are tracking hundreds of system metrics (for example, how much heap is used by each web server), as well as many business metrics (how many payment plans have been charged in the past hour).
We set up monitors for each of these metrics on Datadog, that when triggered, will page an on call engineer. The trigger is usually based on some threshold for that metric.
As our team grew and more alerts were added we noticed three problems with Datadog:
- Any member of our team can edit or delete alerts in Datadog’s UI. The changes may be intentional or accidental, though our team prefers to review changes before they hit production. In Datadog, the review stage is missing.
- Due to the previous problem, sometimes an engineer would add a new alert with uncalibrated thresholds to datadog to get some initial monitoring for a newly written component. As Murphy’s law would have it, the new alert would fire at 3am waking up the on call engineer, and it may not even indicate a real production issue, but a miscalibrated threshold. A review system could better enforce best practices for new alerts.
- Datadog also does not expose a way to indicate that an alert should only be sent during business hours. For example, for some of our batch jobs, it is okay if they fail during the night, but we want an engineer to address it first thing in the morning.
To solve these problems, we made DogPush. It lets you manage your alerts as YAML files that you can check in your source control. So you can use your existing code review system to review them, and once they’re approved they get automatically pushed to DataDog — Voila! In addition, it’s straightforward to setup a cron job (or a Jenkins job) to automatically mute the relevant alerts outside business hours.
DogPush is completely free and open source – check it out here.