aiScaler implements automated monitoring and alerting

Posted by Max Robbins on October 25th, 2010

aiScaler reports a rich set of statistics and instrumentation information via CLI, SNMP and Web interfaces.
While we suggest the of these facilities as part of you regular monitoring and troubleshooting arsenal., we feel SNMP is ideally suited for charting and reporting functions. This includes historical (trend) analysis if your software supports it.

To simplify matters and to allow you to be ready on day one aiScaler provides automated alerting capability. A number of critical parameters can be set for monitoring. When aiCache detects values that are out of configured boundaries “out-of-spec”, alert emails are generated and sent to your chosen email addresses.

Alerting is done in a fashion that prevents alert floods, a common mishap known to overflow mailbox’s and cause apathy, loss of attention and sleep deprivation of IT staff.

No more than one alert is generated per minute. When multiple error conditions are detected, these are aggregated in alerts. No “back-to-normal” alerts are generated, such “OK” condition is indicated by lack of alerts.

Alerts can be global or website-specific. Each type can have its own alert email address defined. The global alerts contain global out-of-spec conditions and include (are a super-set of) all of website alerts that were generated at the same time.

Let’s explain this with an example. Assuming aiScaler accelerates news.acme.com, video.acme.com and boards.acme.com, global alerts are sent out whenever a problem is detected with any of these sites . Each of the website can have their own alert email defined that will be used to alert when that specific website is in out-of-spec condition.
You can alert the helpdesk when any of the sites are affected. Problems with boards.acme.com are reported to the boards team and so on resectively.

This way you can make sure you don’t alert the wrong teams or, in shared hosting setup, compromise the privacy of information with other customers.

Here’s the list of global conditions that you can alert on, along with example values:

． alert_bad_req_sec 20
． alert_max_cache_entries 10000
． alert_client_conn_max 30000
． alert_client_conn_min 5
． alert_os_conn_max 20
． alert_os_fails_sec 2
． alert_req_sec_max 2500
． alert_req_sec_min 10
． alert_os_rt 200

Here’s the list of website conditions that you can alert on, along with example values:

． alert_max_cache_entries 10000
． alert_os_fails_sec 2
． alert_req_sec_max 2500
． alert_req_sec_min 10
． alert_os_rt 200

In addition to these, global and website alerts are issued whenever a disabled origin server is detected. As you recall, origin servers are disabled by aiScaler when they fail to pass a health check. If no health checking is configured for a website, this condition is not reported, so make sure you setup/enable health check monitoring if you want to be alerted on state of origin servers.

The names of condition settings are self-explanatory, but here are some quick suggestions:

Do monitor on origin server response time by configuring alert alert_os_rt, this way if they ever slow down, you are alerted.

Monitoring on number of cached responses by configuring alert_max_cache_entries alerts when a number of cached responses grows past set limit. This might indicate you needing to enable parameter busting or ignoring query string in its entirety for some URLs.

Setup monitors for origin server failures by configuring alert alert_os_fails_sec, abnormally high number normally indicate serious problems with origin servers, App or DB server and other backend infrastructure . Setup alert_os_conn_max to monitor for growing number of origin server connections, this typically indicated a similar problem.

Monitor number of req/sec by configuring alert alert_req_sec_max and alert_req_sec_min – abnormally low and high numbers might indicate a problem or an attention-worthy condition (i.e where did all the users go ? problems with uplink connectivity, content going viral).

Monitor number of bad requests per sec by configuring alert_bad_req_sec . High numbers might mean you’re under DOS attack and are being bombarded with malformed requests and/or bogus connections .

It is difficult to define defaults that work for all sites, so we recommend you take time to decide what constitute’s an out-of-spec condition for your particular setup.

Look Ma, I can now sleep at night: as a special gift to overworked and under-appreciated IT crowd, you can disable alerting between hours of midnight and 7am local time by specifying alert_humane directive in global and/or website sections.

Using this also makes sense if the traffic to the affected sites drops to very low numbers during these hours and that causes monitors to fire on things like low req/sec or low client connections.

Categories

Archive

aiScaler implements automated monitoring and alerting

Leave Comment