Skip to content
This repository was archived by the owner on Dec 18, 2019. It is now read-only.

Conversation

@earthgecko
Copy link

Added new alerter syslog
alert_syslog to write anomalous metrics to the syslog at LOG_LOCAL4, so
LOG_WARN priority this creates in local syslog and ships to any remote syslog
as well, so that it can be used further down a data pipeline in elasticsearch,
riemann, etc
Modified:
readme.md
src/settings.py.example
src/analyzer/alerters.py

Added new alerter syslog
alert_syslog to write anomalous metrics to the syslog at LOG_LOCAL4, so
LOG_WARN priority this creates in local syslog and ships to any remote syslog
as well, so that it can be used further down a data pipeline in elasticsearch,
riemann, etc
Modified:
readme.md
src/settings.py.example
src/analyzer/alerters.py
@earthgecko earthgecko closed this Jun 4, 2014
@earthgecko earthgecko deleted the alert_syslog branch June 4, 2014 18:23
@earthgecko
Copy link
Author

Hi Abe

I deleted it because the travis build failed (typo). Added 88 corrected

But you already know that now :) (I imagine...)

And surprised to see skyline is not sending skyline.analyzer.total_anomalies

And #88 alert_syslog probably needs a rate filter on it so it does not
vommit 1000's of anomalies into syslog and start some bad negative
feedback cycle... however there is no last counts to check on

I am thinking about adding last counts to redis so:

last seconds to run
last total metrics
last total analyzed
last total anomalies
last exception stats
last anomaly breakdown

Then in the context of the alert_syslog function anyway, rate limits
could then be determined e.g.

if current_total_anomalies > 15% of last total analyzed, exit
alert_syslog gracefully
etc

So skyline is just awesome on top of awesome, however it is fairly hard
to grip on alerting as it is super transient, however most of us already
have a data pipeline that we are feeding and analysing with skyline.
skyline is just not feeding back into that pipeline so we can see it and
manage it in the same way. I know oculus, not going there :)

I just see the ability of skyline to feedback anomalies into the
pipeline and we can rate the events/anomalies, we can index them and I
think we can even timeseries the anomalies themselves:
/opt/graphite/storage/whispers/stats/hostname/memory/cache.wsp
/opt/graphite/storage/whispers/stats/hostname/memory/cache_anomalies.wsp
or
/opt/graphite/storage/whispers/stats/hostname/memory/cache/anomalies.wsp

stats.hostname.memory.cache_anomalies:1|c
or
stats.hostname.memory.cache.anomalies:1|c

Which would be better? :)

How to double your whisper storage usage overnight, however it could be
for selected stats I guess, keyword/metric targeted.

Am I mad?

skyline's alerting is hard, yes? We have logstash, riemann, es,
graphite in the mix already - feeding some skyline stuff back into the
pipeline should make it a better/easier to alert on (aggregated rates,
etc). Dealing with peaking ad_impressions from lots of publishers makes
it seem that way (very peaky). However, all this train of thought is
only in POC at the moment so I could be very wrong. "I once thought I
was wrong, but I was mistaken", as the saying goes :)

Anyway... some thoughts. However are the moment I can surface anomalies
in ES via kibana and search them too on keyword target, so I now can
rate that to riemann and on to graphite

Probably going to make the next pull request:

send_statsd_metric

It seems like a missing one ;)

I think if our FULL_DURATION was longer than 86400 (say 9 days) - then
our peaky publishers would be more normalised and not seem so
anomalous... however 9 days FULL_DURATION - I know is not possible ==
'kill machine/s'. However that said, would it not be possible to
surface the &rawData=true metrics for certain keyword targeted anomalies
for an 8 day period? Cache the result and analyze cached_surfaced.key +
key for example? That way for metrics you knew were really seasonal
(weekly, monthly, yearly) skyline could analyse over the relevant
timeseries - agumenting FULL_DURATION

Another thing that occurs to me is that you can shard/segment skyline
across multiple instances with the SKIP_LIST

But that would mean skyline.hostname.analyzer metrics would be needed.
Just thinking in terms of separating normal collective operational
metrics from application and business critical metrics in terms of
skyline and would maybe be possible to run FULL_DURATION at 6 months on
just business critical metrics. As I said thoughts.

Regards
Gary

PS - Nice to meet you Abe, thanks to you guys for skyline (and statsd)
and all the talks and sharing. Always inspiring.

On 04/06/14 19:34, Abe Stanway wrote:

why did you delete that?


Reply to this email directly or view it on GitHub
#86 (comment).

Gary Wilson
The Wizard of Of

of
gary.wilson@of-networks.co.uk mailto:gary.wilson@of-networks.co.uk
+44 (0) 117 270 9574
+32671845899
of-networks.co.uk http://www.of-networks.co.uk

@astanway
Copy link
Contributor

These are a lot of thoughts :) Maybe the mailing list would be more appropriate. But thank you!

As for analyzing the anomalies themselves, I have started a bit of work on that: https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py#L232. It's hard, though, and it'll require a more fleshed out design. But you're right, a low hanging fruit would be to track total_anomalies and alert if the number of anomalies spikes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants