alert_syslog #86

earthgecko · 2014-06-04T16:32:41Z

Added new alerter syslog
alert_syslog to write anomalous metrics to the syslog at LOG_LOCAL4, so
LOG_WARN priority this creates in local syslog and ships to any remote syslog
as well, so that it can be used further down a data pipeline in elasticsearch,
riemann, etc
Modified:
readme.md
src/settings.py.example
src/analyzer/alerters.py

Added new alerter syslog alert_syslog to write anomalous metrics to the syslog at LOG_LOCAL4, so LOG_WARN priority this creates in local syslog and ships to any remote syslog as well, so that it can be used further down a data pipeline in elasticsearch, riemann, etc Modified: readme.md src/settings.py.example src/analyzer/alerters.py

earthgecko · 2014-06-04T20:11:33Z

Hi Abe

I deleted it because the travis build failed (typo). Added 88 corrected

Added new alerter syslog - alert_syslog #88

But you already know that now :) (I imagine...)

And surprised to see skyline is not sending skyline.analyzer.total_anomalies

And #88 alert_syslog probably needs a rate filter on it so it does not
vommit 1000's of anomalies into syslog and start some bad negative
feedback cycle... however there is no last counts to check on

I am thinking about adding last counts to redis so:

last seconds to run
last total metrics
last total analyzed
last total anomalies
last exception stats
last anomaly breakdown

Then in the context of the alert_syslog function anyway, rate limits
could then be determined e.g.

if current_total_anomalies > 15% of last total analyzed, exit
alert_syslog gracefully
etc

So skyline is just awesome on top of awesome, however it is fairly hard
to grip on alerting as it is super transient, however most of us already
have a data pipeline that we are feeding and analysing with skyline.
skyline is just not feeding back into that pipeline so we can see it and
manage it in the same way. I know oculus, not going there :)

I just see the ability of skyline to feedback anomalies into the
pipeline and we can rate the events/anomalies, we can index them and I
think we can even timeseries the anomalies themselves:
/opt/graphite/storage/whispers/stats/hostname/memory/cache.wsp
/opt/graphite/storage/whispers/stats/hostname/memory/cache_anomalies.wsp
or
/opt/graphite/storage/whispers/stats/hostname/memory/cache/anomalies.wsp

stats.hostname.memory.cache_anomalies:1|c
or
stats.hostname.memory.cache.anomalies:1|c

Which would be better? :)

How to double your whisper storage usage overnight, however it could be
for selected stats I guess, keyword/metric targeted.

Am I mad?

skyline's alerting is hard, yes? We have logstash, riemann, es,
graphite in the mix already - feeding some skyline stuff back into the
pipeline should make it a better/easier to alert on (aggregated rates,
etc). Dealing with peaking ad_impressions from lots of publishers makes
it seem that way (very peaky). However, all this train of thought is
only in POC at the moment so I could be very wrong. "I once thought I
was wrong, but I was mistaken", as the saying goes :)

Anyway... some thoughts. However are the moment I can surface anomalies
in ES via kibana and search them too on keyword target, so I now can
rate that to riemann and on to graphite

Probably going to make the next pull request:

send_statsd_metric

It seems like a missing one ;)

I think if our FULL_DURATION was longer than 86400 (say 9 days) - then
our peaky publishers would be more normalised and not seem so
anomalous... however 9 days FULL_DURATION - I know is not possible ==
'kill machine/s'. However that said, would it not be possible to
surface the &rawData=true metrics for certain keyword targeted anomalies
for an 8 day period? Cache the result and analyze cached_surfaced.key +
key for example? That way for metrics you knew were really seasonal
(weekly, monthly, yearly) skyline could analyse over the relevant
timeseries - agumenting FULL_DURATION

Another thing that occurs to me is that you can shard/segment skyline
across multiple instances with the SKIP_LIST

But that would mean skyline.hostname.analyzer metrics would be needed.
Just thinking in terms of separating normal collective operational
metrics from application and business critical metrics in terms of
skyline and would maybe be possible to run FULL_DURATION at 6 months on
just business critical metrics. As I said thoughts.

Regards
Gary

PS - Nice to meet you Abe, thanks to you guys for skyline (and statsd)
and all the talks and sharing. Always inspiring.

On 04/06/14 19:34, Abe Stanway wrote:

why did you delete that?

—
Reply to this email directly or view it on GitHub
#86 (comment).

Gary Wilson
The Wizard of Of

of
gary.wilson@of-networks.co.uk mailto:gary.wilson@of-networks.co.uk
+44 (0) 117 270 9574
+32671845899
of-networks.co.uk http://www.of-networks.co.uk

astanway · 2014-06-10T04:20:41Z

These are a lot of thoughts :) Maybe the mailing list would be more appropriate. But thank you!

As for analyzing the anomalies themselves, I have started a bit of work on that: https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py#L232. It's hard, though, and it'll require a more fleshed out design. But you're right, a low hanging fruit would be to track total_anomalies and alert if the number of anomalies spikes.

earthgecko closed this Jun 4, 2014

earthgecko deleted the alert_syslog branch June 4, 2014 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alert_syslog #86

alert_syslog #86

Uh oh!

earthgecko commented Jun 4, 2014

Uh oh!

earthgecko commented Jun 4, 2014

Uh oh!

astanway commented Jun 10, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alert_syslog #86

alert_syslog #86

Uh oh!

Conversation

earthgecko commented Jun 4, 2014

Uh oh!

earthgecko commented Jun 4, 2014

Uh oh!

astanway commented Jun 10, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants