Help Docs ~ Server Monititor

Overview

Enterprise-grade server monitoring without the bloat.

Why Pingdom Server Monitor?

Pingdom Server Monitor (PSM) is server monitoring for the modern tech team: a powerful set of features for the DevOps team and an intuitive UI for the app development folks.

With Pingdom Server Monitor, you get everything you need to monitor your infrastructure in a tight package: flexible dashboards and alerting, deploy in five minute or less, 80+ monitoring plugins, and no ugly configuration syntax to memorize (do it all via our web UI).

Here's an overview of PSM's key parts:

The agent

The only piece of software you'll install is our monitoring agent, Scoutd. The agent reports metrics every minute over SSL to our servers at server.pingdom.com.

Auto-discovery

Out-of-the-box, our agent reports key system performance metrics like disk, memory, and CPU usage as well the resource usage of processes running on the server.

Plugins

There's no need to SSH onto your servers to configure monitoring scripts. With PSM, you’ll setup plugins through our web interface.

From Apache to Zookeeper, we likely have you covered with our 80+ plugins, all open-source and available on Github.

Alerting

Set triggers on any metric reported to PSM (including those from your own custom plugins). You can choose to have PSM notify you of alerts in a variety of ways (including PagerDuty).

Dashboards

PSM's dashboards UI is remarkably powerful yet easy-to-use. Custom built from the ground-up, you can put all of your key metrics onto a single, auto-updating page.

Roles

Roles make it easy to keep your monitoring configuration in sync across many servers. Define roles in our UI (examples: load balancer, database, application), then configure plugins and alert triggers.

Servers assigned to a role get all of the role's plugins and alert settings. As you tweak your roles, the servers get those updates as well. Everything stays in sync.

Support for configuration management tools

PSM plays well with your configuration management tool of choice: checkout our Chef Cookbox, Puppet Module, or Ansible Playbook.

How we monitor our servers

Surprise! At Pingdom, we use PSM to monitor our own servers. We know the ins-and-outs of Pingdom Server Monitor better than anyone - here's how we've setup our monitoring to give us that rock-solid feeling.

Monitor all of the things. Alert only on performance and stability.

We monitor each part of our software stack at PSM with plugins: from MySQL Replication to HAProxy to DelayedJob. We are liberal when tracking metrics and conservative on our alert volume.

Communication is key to any relationship, including your critical relationship with your monitoring software. If your monitoring setup cries wolf too frequently, you'll start to tune it out. Here's our rules for setting up actionable alerting:

Only alert if it impacts performance and/or stability

We don't configure alerts on metrics like web traffic throughput, MySQL query rates, and network traffic. These metrics can fluctuate without making our applications slow or unstable. Instead, we'll configure alerts on metrics that do cause performance and stability issues (high disk utilization on a database server, high memory usage, MySQL replication failing, etc). When we get an alert, we'll then use dashboards to correlate the alert w/other metrics that could lead to the root cause.

The SMS rule: do you want to wake up while sleeping?

Before configuring a trigger that alerts you via SMS, ask yourself: if this goes off in the middle of the night, will you act on it? If not, it's probably not appropriate for SMS. When you are on-call and your phone goes off, the adrenaline flows. It's difficult to go back to sleep.

Use plateau triggers to filter out non-critical spikes

A brief one-minute disk utilization spike likely won't cause a crippling problem: a ten-minute spike will. Use plateau triggers: these require a metric to meet a sustained value over a period of time (ex: swap memory usage greater than 50% for ten minutes).

Use roles to stay in sync

We have groups of servers that each perform similar roles:

  • application servers
  • database servers
  • metric storage servers
  • load balancers
  • utility servers

Each type of server should have the same monitoring configuration. Rather than configuring monitoring on each server, we define a role for each type of server, add plugins and triggers to those roles, and assign those roles to the applicable servers.

This makes maintaining our monitoring setup stupid simple and immediately sets up monitoring on new servers:

  • When you modify a role, all servers with that role get the updated configuration
  • When a new server reports, it receives the monitoring configuration of the assigned roles.

Roles can be assigned via the Pingdom Server Monitor UI or via the --role flag with the agent. Our configuration management scripts all support setting roles as well.

Try these default triggers

These are triggers we'd recommend for just about any setup. We have these setup on our "All Servers" role:

  • Disk Capacity >= 80%
  • CPU Load >= 20 for 10 minutes
  • % Memory Used >= 80%
  • % Swap Used >= 50%
  • Disk Utilization % >= 95% for 10 minutes

Our MySQL servers have the following triggers:

  • Slow Query Rate increases 30% vs. the previous hour with a minimum of 5 slow queries.
  • Replication Running <= 0 (zero means replication isn't running)
  • Replication Seconds Behind Master >= 60 seconds

Load Balancers (using HAProxy and Apache):

  • Active Servers <= 3 servers
  • Apache 503 errors (service unreachable) exceeds 500 for 10 minutes

Utility Servers (run DelayedJob):

  • Oldest Waiting Job >= 30 minutes

Three Notification Groups

We have three notification groups:

  • Our default notification group sends emails to our team.
  • Our urgent notification group sends emails and sms messages to our team.
  • Our empty notification group has no members. It's the /dev/null of notifications.

We never use SMS messages for our staging environment (more on this below).

Use environments to filter out misbehaving staging servers

Things can get a bit wonky in our staging environment. We make frequent deploys to staging and occasionally things go bad.

To filter out staging noise, we do the following:

  • Create a "staging" environment in the PSM UI
  • Assign our staging servers to the staging PSM environment in our Puppet module
  • When configuring triggers, we never assign the "urgent" notification group to the staging environment. We either assign the default or the empty notification group to our triggers for the staging environment.

Third-Party tools

We have a laser-focus on agent-based monitoring. This means PSM has great vision inside your servers, but doesn't have vision outside your servers.

These are the other critical monitoring tools for our stack:

  • Sentry for application exception tracking
  • Pingdom for outside service checks
  • Deadman's Snitch to ensure our scheduled jobs are running

Open source components

The Pingdom Server Monitor agent is completely open source. The gem is a normal Ruby gem, open for development, available on GitHub and distributed under the MIT and/or Ruby License (whichever you prefer).

We share a collection of open-source PSM Plugins surrounded and fostered by a community that encourages branching, fixes, and general openness. PSM plugins can be accessed via GitHub.

The PSM Server -- which handles the data collection, analysis, trending, and notifications -- is not open-source. We maintain the server, and keep all your data safe and sound.

All of our open source projects can be found under our Github account.

Overview

Enterprise-grade server monitoring without the bloat.