occam

"The fewest assumptions" - A simple matching / alerting service for JSON messages.

Overview

Occam is a simple event matching service that allows you to apply field matching and alerting logic to a stream of JSON messages - using a simple, declarative Python syntax (stored in checks.py) that is automatically parallelized under the hood. Messages are read from a Redis list (using the list name 'messages'), populated by any means of choice. More robust queuing systems will be added in future updates.

Example use cases:

Collecting package versions from every node in your infrastructure and triggering a HipChat message if specific versions of a given package are detected reported in
Firing off a PagerDuty alert if a specific service has triggered a warning n times in x seconds across your whole fleet

The following in checks.py would check if any incoming messages included the field 'somefield' with the value 'someval', sending the output to console upon match:

if inMatch(msg, "somefield", "someval"): outConsole(msg)

A message pushed into the reference Redis list:

% redis-cli lpush messages '{ "somefield": "someval" }'

Starting Occam, yielding the matched message according to the configured check:

% ./occam.py
2015-01-21 11:20:21,920 | INFO | Waiting for Blacklist Rules sync
2015-01-21 11:20:21,922 | INFO | Connected to Redis at 127.0.0.1:6379
2015-01-21 11:20:21,922 | INFO | API - Listening at 0.0.0.0:8080
2015-01-21 11:20:27,127 | INFO | Redis Reader Task Started
2015-01-21 11:20:28,235 | INFO | Event Match: { "somefield": "someval" }

Matching syntax can be nested and chained to require additional conditions:

if inMatch(msg, "@type", "ssh-log") and inMatch(msg, "failed-attempt", "true"):
  outConsole(msg)

The above check would trigger a log to console, given a single message where the fields '@type' and 'failed-attempt' held the values 'ssh-log' and 'true', respectively.

Additional input / output actions exist that allow for more complex logic, such as sub-sampling or different output actions at each depth of conditions met. A real life configuration might look like:

# A block of checks for all '@type' 'service-health' messages. 
if inMatch(msg, "@type": "service-health"):
  # If level is critical, alert via PagerDuty immediately.
  if inMatch(msg, "level", "critical"): outPd(msg)
  # If a single hosts reports 5 warning levels in 60s,
  if inMatch(msg, "level", "warning") and inRateKeyed(msg, "hostname", 5, 60):
    # And if within this warning treshold, 10 times in 30s it's due to the newly released 
    # 'new-service', then also send us a PagerDuty message:
    if inMatch(msg, "service", "new-service") and inRate(10, 30):
      outPd(msg)
  # Otherwise, just notify the ops HipChat room:
  else:
    outHc(msg, "ops-room")

Typically, you'd throw a top-level 'inMatch' for a whole class of message types (such as all messages where the field '@type' is 'apache') followed by all of the matching logic respecive to that message class. The above block would be all of the matching / alerting logic we care about solely in regards to 'service-health' messages.

Inputs

inMatch

A basic equality check. With the input JSON 'msg', check if 'somefield' = 'somevalue'.

if inMatch(msg, "somefield", "somevalue")

inRegex

Python regex (re) matching. With the input JSON 'msg', check pattern '.*' against the value of 'somefield'.

if inRegex(msg, "somefield", ".*")

inRate

Time window rate check. Anchor function that is placed within a series of conditionals that requires a threshold of all preceding conditions to have been met '5' times within a '30' second window, otherwise, the chain of conditions will be short-circuited.

inRate(5, 30)

Example:

if inMatch(msg, "somefield", "somevalue") and inRate(5, 30): outConsole(msg)

inRateKeyed

Time window rate check that dynamically generates seperate rate checks based on the value of a given message field.

inRateKeyed(msg, "somefield", 5, 30)

Given the following checks logic, we would be able to use a generic rate checking syntax that would only trigger if the rate treshold were met from a single host:

if inMatch(msg, "error-level", "warning") and inRateKeyed(msg, "hostname", 5, 30): outConsole(msg)

If this same check were configured using the basic 'inRate' check as follows:

if inMatch(msg, "error-level", "warning") and inRate(5, 30): outConsole(msg)

The following message stream (within a 30s window) would trigger a match even though no single host exceeded the rate threshold:

'{ "error-level": "warning", "hostname": "host-1" }'
'{ "error-level": "warning", "hostname": "host-1" }'
'{ "error-level": "warning", "hostname": "host-2" }'
'{ "error-level": "warning", "hostname": "host-2" }'
'{ "error-level": "warning", "hostname": "host-2" }'

Outputs

outConsole

Writes 'msg' JSON to stdout upon match.

outConsole(msg)

outPd

Triggers a PagerDuty alert to the specified service_key alias (see config file - multiple service_keys by alias is supported) via the PagerDuty generic API, appending the whole 'msg' JSON output as the PagerDuty alert 'details' body. An incident_key and PagerDuty alert description is automatically generated unless specified as a second parameter:

outPd(msg, "service-alias", "web01-alerts")

It's also valid to use a portion of the message body to dynamically generate an incident key:

outPd(msg, "service-alias", msg['hostname'])

As well as a combination of a fixed string and unique message data:

outPd(msg, "service-alias", msg['somefield'] + " High Load")

Yielding:

2015-01-10 09:44:31,611 | INFO | Event Match: {'somefield': 'somevalue', '@type': 'type'}
2015-01-10 09:44:31,622 | INFO | Starting new HTTPS connection (1): events.pagerduty.com
2015-01-10 09:44:32,617 | INFO | Message sent to PagerDuty: {"status":"success","message":"Event processed","incident_key":"somevalue High Load"}

outHc

Sends a room notification to HipChat via the v2 REST API via a room ID and token. The room ID and token pair is referenced by an alias defined in the config file, with an underscore delimited <id>_<token> value:

[hipchat]
test-room: 000000_00000000000000000000

An alert output (in checks.py) configured to send a message to the corresponding HipChat room configuration:

if inMatch(msg, "somefield", "somevalue"): outHc(msg, "test-room")

Outage API

Note: work in progress.

Add Outage

curl localhost:8080/ -XPOST -d '{"outage": "field:value:hours"}'

Get Outages

curl localhost:8080/

Remove Outage

localhost:8080/ -XDELETE -d '{"outage": "field:value"}'

The Outage API allows you to maintain a global map of key-value 'blacklist' data that all messages are checked against and immediately dropped upon match. Blacklisting works as follows:

% curl localhost:8080/ -XPOST -d '{"outage": "somefield:somevalue:2"}'
Request Received: {"outage": "somefield:somevalue:2"}
% curl localhost:8080/ -XPOST -d '{"outage": "somefield:anothervalue:6"}'
Request Received: {"outage": "somefield:anothervalue:6"}

Occam propagating the rules to all workers running checks:

% ./occam.py
2015-01-19 15:51:11,905 | INFO | API - Listening at 0.0.0.0:8080
2015-01-19 15:51:21,906 | INFO | Connected to Redis at 127.0.0.1:6379
2015-01-19 15:51:28,609 | INFO | API - Outage Request: where 'somefield' == 'somevalue' for 2 hour(s)
2015-01-19 15:51:33,907 | INFO | Worker-0 - Blacklist Rules Updated: {"somefield": ["somevalue"]}
2015-01-19 15:51:33,909 | INFO | Worker-1 - Blacklist Rules Updated: {"somefield": ["somevalue"]}
2015-01-19 15:51:50,821 | INFO | API - Outage Request: where 'somefield' == 'anothervalue' for 6 hour(s)
2015-01-19 15:51:54,918 | INFO | Worker-1 - Blacklist Rules Updated: {"somefield": ["somevalue", "anothervalue"]}
2015-01-19 15:51:54,919 | INFO | Worker-0 - Blacklist Rules Updated: {"somefield": ["somevalue", "anothervalue"]}

Every inbound message where 'somefield' equals 'somevalue' will be dropped for 2 hours, and for 6 hours where 'somefield' equals 'anothervalue'. Any number of fields and field-values can be specified, each combination with a separate outage duration.

Current outages can be fetched via:

% curl localhost:8080/
{
  "Current Outages Scheduled": {
    "somefield": [
    "anothervalue",
    "somevalue"
    ]
    },
    "Occam Start Time": "2015-01-19 09:45:47"
  }

Outage data is persisted in a Redis set, populated with a unique hash ID for each field-value pair. Each hash ID references a TTL'd Redis key with the field-value data- which is polled every 5 seconds, translated into a blacklist map, then propagated to the worker processes (which log any updates to the blacklist map).

Blacklist data can thus survive Occam service restarts as well as automatically propagate across a fleet of multiple Occam nodes processing a single stream of messages. The Outage API's hashing and storage method allows blacklist data to be written or updated from any node without collision, essentially allowing masterless read/write access to shared data.

Performance

Occam uses Redis as a local queue and is built on Python, inheritely not a very performant language. It's strongly recommended to ensure hiredis is installed.
All checks in checks.py are parallelized 'n' ways if 2 or more hardware threads are available, where 'n' = max(multiprocessing.cpu_count()-1, 2). CPU load depends on complexity / size of checks applied.
'inRegex' is significantly more computationally expensive than basic 'inMatch' checks. It's recommended to pre-filter inbound messages to 'inRegex' as much as possible with basic matches.

Misc.

Occam attempts to not ditch messages popped from Redis; the reader loop halts on shutdown and workers allow in-flight messages to complete:

^C2015-01-09 10:36:49,211 | INFO | Stopping Reader Threads
2015-01-09 10:36:49,211 | INFO | Waiting for in-flight messages

Pending

Lots more inputs/outputs; see open issues.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.gitignore		.gitignore
README.md		README.md
checks.py		checks.py
config		config
log.py		log.py
matchers.py		matchers.py
occam.py		occam.py
outputs.py		outputs.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

occam

Overview

Inputs

inMatch

inRegex

inRate

inRateKeyed

Outputs

outConsole

outPd

outHc

Outage API

Add Outage

Get Outages

Remove Outage

Performance

Misc.

Pending

About

Releases 1

Packages

Contributors 2

Languages

jamiealquiza/occam

Folders and files

Latest commit

History

Repository files navigation

occam

Overview

Inputs

inMatch

inRegex

inRate

inRateKeyed

Outputs

outConsole

outPd

outHc

Outage API

Add Outage

Get Outages

Remove Outage

Performance

Misc.

Pending

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages