Skip to content
Jim Pivarski edited this page Mar 15, 2015 · 1 revision
  1. summary Tutorial with examples
  2. labels Featured

Introduction

Cassius is a plotting toolkit, intended for interactive use. As such, the commands that users type have been simplified as much as possible. To start Cassius (see [Installation] if necessary), start an interactive Python session and load the user-level objects:

linux> python
Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from cassius import *
>>> import random, math

The from `module import *` idiom is normally discouraged in Python as it loads objects into global scope, rather than keeping them in separate namespaces. However, it is inconvenient to type “`cassius.`” in front of everything (or even “`c.`“) when you’re working interactively. A complete list of objects loaded by this statement is presented [http://174.129.21.118/cassius/html/organization here].

Histogram

The two basic plot containers in Cassius are `Histogram` and `Scatter`. Histograms represent distributions, including bar charts and (someday) pie charts, in a single interface. Scatters represent general X-Y data, including time-series, parametric lines, and data with error bars. Here are some examples:

>>> h = Histogram(20, 0., 100.)
>>> h
<Histogram 20 0 100 at 0x1d0edd0>
>>> print h
bin                            value
========================================
[0, 5)                         0
[5, 10)                        0
[10, 15)                       0
[15, 20)                       0
[20, 25)                       0
[25, 30)                       0
[30, 35)                       0
[35, 40)                       0
[40, 45)                       0
[45, 50)                       0
[50, 55)                       0
[55, 60)                       0
[60, 65)                       0
[65, 70)                       0
[70, 75)                       0
[75, 80)                       0
[80, 85)                       0
[85, 90)                       0
[90, 95)                       0
[95, 100)                      0

>>> for i in xrange(10000):
...     h.fill(random.gauss(50., 10.))
...
>>> print h
bin                            value
========================================
[0, 5)                         0
[5, 10)                        0
[10, 15)                       2
[15, 20)                       6
[20, 25)                       41
[25, 30)                       169
[30, 35)                       421
[35, 40)                       876
[40, 45)                       1515
[45, 50)                       1937
[50, 55)                       1920
[55, 60)                       1501
[60, 65)                       932
[65, 70)                       447
[70, 75)                       168
[75, 80)                       52
[80, 85)                       12
[85, 90)                       1
[90, 95)                       0
[95, 100)                      0
>>> view(h)

If you have the optional viewer installed, a window will pop up containing a plot that looks like the one below. If not, you will get an error message telling you to `draw` the plot to a file and look at it with an external viewer:

>>> draw(h, fileName="/tmp/tmp.svg")
linux> inkview tmp.svg

http://cassius.googlecode.com/files/Tutorial_10.png

In this tutorial, I will use `view`, but you can always substitute `draw`. Note that you can also `drawpdf`, which works the same way (see the [http://174.129.21.118/cassius/html/reference_svgdraw svgdraw module]).

Now, one must always label one’s axes...

>>> h.xlabel = "x"
>>> h.ylabel = "counts per bin"
>>> view(h)

http://cassius.googlecode.com/files/Tutorial_20.png

In the above, we attached an `xlabel` and a `ylabel` to the histogram. Nearly all options can be passed as constructor arguments

>>> h2 = Histogram(20, 0., 100., xlabel="x", ylabel="counts per bin")

or set as data members (`h2.xlabel`, `h2.ylabel`).

Arguments specifying the coordinate frame (frame arguments) can also be specified directly in the `view` or `draw` command, overriding the object but not “sticking” to the object (see [http://174.129.21.118/cassius/html/percolation Percolation] for details). This allows for quicker experimentation.

>>> view(h, xlabel="I like this title better")

http://cassius.googlecode.com/files/Tutorial_30.png

Histograms can, of course, be colored:

>>> h.fillcolor = "lightblue"
>>> view(h)

http://cassius.googlecode.com/files/Tutorial_40.png

See [http://174.129.21.118/cassius/html/reference_histograms#histogram-1-d-distribution-with-uniform-binning Histogram] and [http://174.129.21.118/cassius/html/reference_color Manipulating color] for details.

We can make some dramatic changes to the tick-marks when we need to:

>>> h.xticks = {0.: "zero",       10.: None,
...             20.: "twenty",    30.: None,
...             40.: "forty",     50.: None,
...             60.: "sixty",     70.: None,
...             80.: "eighty",    90.: None,
...             100.: "hundred"}
...
>>> view(h)

http://cassius.googlecode.com/files/Tutorial_50.png

and logarithmic axes have been provided for, but not yet debugged:

>>> view(h, ylog=True)
ERROR

(Logarithmic axes will be supported in a future version of Cassius.)

Rebinning histograms after they have been filled is usually limited to reducing the number of bins by a multiple of an integer. That’s not convenient. If you would like to arbitrarily rebin histograms or change their endpoints, you can give the histogram a “store,” which is a copy of the input data associated to the histogram. (The `storelimit`, the maximum number of saved entries, can be set to None for no limit.)

>>> h2 = Histogram(20, 0., 100., storelimit=None)
>>> for i in xrange(10000):
...     h2.fill(random.gauss(80., 10.))
...
>>> view(h2)

http://cassius.googlecode.com/files/Tutorial_52.png

Now we can rebin or change the endpoints however we like.

>>> h2.reshape(300, 0., 200.)
>>> view(h2)

http://cassius.googlecode.com/files/Tutorial_54.png

The store allows us to make a distinction between estimators derived from the histogram’s bins and estimators derived from the raw data:

>>> h2.reshape(20, 0., 100.)
>>> h2.mean(), mean(h2.store())
(79.53923778301403, 80.1229660042749)

Sometimes we want to study distributions in a categorical space. For that, there is `HistogramCategorical`, which shares some code with `Histogram` (they both derive from `HistogramAbstract`).

>>> hcat = HistogramCategorical(bins=["one", "two", "three", "four", "five",
...                                   "six", "seven", "eight", "nine", "ten"],
...                             fillcolor="yellow")
...

>>> for i in xrange(10000):
...     x = {1: "one", 2: "two", 3: "three", 4: "four", 5: "five", 6: "six",
...          7: "seven", 8: "eight", 9: "nine", 10: "ten"}[random.randint(1, 10)]
...     hcat.fill(x)
...

>>> print hcat
bin                            value
========================================
"one"                          1013
"two"                          1015
"three"                        1037
"four"                         965
"five"                         1022
"six"                          972
"seven"                        1016
"eight"                        943
"nine"                         1004
"ten"                          1013
>>> view(hcat)

http://cassius.googlecode.com/files/Tutorial_60.png

Some would call the above a bar chart.

In categorical datasets dominated by a few values (not this case), it is often useful to only plot the top N categories:

>>> view(hcat.top(5))

http://cassius.googlecode.com/files/Tutorial_70.png

A pie chart has the same numerical content, but is drawn as wedges of a circle. Therefore, pie charts should be an alternate drawing mode for `HistogramCategorical` (not yet implemented).

Scatter

Scatter-plots are produced with the `Scatter` class:

>>> xlist = [random.uniform(0., 100.) for i in xrange(100000)]
>>> ylist = [random.uniform(0., 100.) for i in xrange(100000)]
>>> scat = Scatter(x=xlist, y=ylist, limit=100)
>>> view(scat)

http://cassius.googlecode.com/files/Tutorial_80.png

The `limit` parameter protected our SVG renderer from having to draw 100000 data points (which would be solid black). When a `limit` is specified (by default it’s `None`), only a representative subsample is drawn. (The subsample is uniformly selected, it’s not the beginning or the end of the list!) The `limit` applies to how many points are shown in the graphic, so if you zoom in, you’ll always see the same level of detail.

There are two ways to load data into a scatter-plot: via two separate lists (as above) or by interpreting one matrix:

>>> bothlist = [[random.uniform(0., 50.), random.uniform(0., 100.)] for i in xrange(100)]
>>> scat = Scatter(bothlist, sig=["x", "y"])
>>> view(scat, xmin=0., xmax=150., ymin=0., ymax=150.)

http://cassius.googlecode.com/files/Tutorial_100.png

>>> scat = Scatter(bothlist, sig=["y", "x"])
>>> view(scat, xmin=0., xmax=150., ymin=0., ymax=150.)

http://cassius.googlecode.com/files/Tutorial_90.png

Rather than having a separate container for polygon data, we can make polygons from a `Scatter` just by turning off the markers and turning on a line connecting the points:

>>> scat.connector = "unsorted"
>>> view(scat)

http://cassius.googlecode.com/files/Tutorial_110.png

>>> scat.marker = None
>>> view(scat)

http://cassius.googlecode.com/files/Tutorial_120.png

It’s like a time-series when the lines connect points from left to right:

>>> scat.connector = "xsort"
>>> view(scat)

http://cassius.googlecode.com/files/Tutorial_130.png

It’s even more like a timeseries when the x axis is interpreted from and/or to time strings:

>>> minutes = totimestring([1e9 + 60*i for i in xrange(1000)], "%d/%m/%Y at %H:%M")
>>> minutes[:4]
['09/09/2001 at 01:46', '09/09/2001 at 01:47', '09/09/2001 at 01:48', '09/09/2001 at 01:49']

>>> data = [math.sin(2.*math.pi*i/1000) + random.gauss(5., 1.) for i in xrange(1000)]

>>> ts = TimeSeries(x=minutes, y=data, informat="%d/%m/%Y at %H:%M", outformat="%I %p")
>>> view(ts)

http://cassius.googlecode.com/files/Tutorial_135.png

The `TimeSeries` class derives from `Scatter`, and is only a thin wrapper around it.

Plots with error bars work the same way: they’re Scatter objects with a bit more data:

>>> dataplot = Scatter(x=[1., 2., 3., 4., 5.],
...                    y=[3., 4., 1., 5., 2.],
...                    ey=[0.1, 0.15, 0.2, 0.1, 0.5])
...
>>> view(dataplot)

http://cassius.googlecode.com/files/Tutorial_140.png

There are also x error bars and asymmetric error bars.

Layouts and Overlays

Now that we have made different kinds of plots, we want to put them side-by-side or on top of each other to see relationships between them by eye. Side-by-side plots are produced with a `Layout`:

>>> view(Layout(2, 1, h, h2))

http://cassius.googlecode.com/files/Tutorial_150.png

The structure of this command makes it easy to rearrange the plots quickly, and `Layout` objects can be nested to produce more complex arrangements (see [http://174.129.21.118/cassius/html/reference_organization#layout-a-grid-of-plots Layout]).

On-top-of-each-other plots are produced with an Overlay:

>>> view(Overlay(h, h2))

http://cassius.googlecode.com/files/Tutorial_160.png

Swapping the top and bottom plot requires very few keystrokes:

>>> view(Overlay(h2, h))

http://cassius.googlecode.com/files/Tutorial_170.png

We can also ask the `Overlay` to use the histograms’ own coordinate frame, not one built around the union of all the data (see [http://174.129.21.118/cassius/html/reference_organization#overlay-multiple-plots-in-the-same-coordinate-frame Overlay] for more):

>>> view(Overlay(h, h2, frame=0))

http://cassius.googlecode.com/files/Tutorial_180.png

For the special case of histograms, we can also `Stack`:

>>> h2.fillcolor = "pink"
>>> view(Layout(2, 1, Stack(h, h2), Stack(h2, h)))

http://cassius.googlecode.com/files/Tutorial_185.png

Again, the order determines which is on top.

Gridlines, legends, and such

Plot annotations are objects in the same sense as `Histograms` and `Scatters`, so they are simply overlaid:

>>> scat.connector = None
>>> scat.marker = "circle"

>>> grid = Grid(regular(5.), regular(5.))
>>> view(Overlay(grid, scat))

http://cassius.googlecode.com/files/Tutorial_190.png

which gives you the choice of lines-on-bottom (graph paper) or lines-on-top (jail).

Legends work the same way:

>>> legend = Legend([[scat, "data!"]], justify="c")
>>> view(Overlay(grid, scat, legend))

http://cassius.googlecode.com/files/Tutorial_200.png

(The legend is free-form: you can put whatever you want in it. See [http://174.129.21.118/cassius/html/reference_legend Legend] for more.)

There are some limited 2-D plots, but they’re in development.

The `Curve` class provides mathematical curves (see [http://174.129.21.118/cassius/html/reference_curves Mathematical curve]). This is the part of Cassius that provides curve-fitting as well, though that is in developement (actually, just porting from Minuit to SciPy).

Realistic case: analyzing a large dataset

Now that we can plot random numbers, let’s look at some real data. If you haven’t already, download [http://code.google.com/p/augustus/downloads/detail?name=Augustus-0.4.3.1.tar.gz Augustus 0.4.3.1.tar.gz] and unpack it:

linux> tar -xzvf Augustus-0.4.3.1.tar.gz

The file

Augustus-0.4.3.1/examples/highway/real/data/scoring.csv

contains 25 MB of motor vehicle speeds measured by traffic monitoring devices in the Chicago area.

We load it into Python with `inspect` (`inspect` only works if Augustus <= 0.5.2 is installed):

>>> from cassius import *
>>> import math, itertools
>>> highway = inspect("Augustus-0.4.3.1/examples/highway/real/data/scoring.csv")

and take a look at the number of records (rows)

>>> len(highway)
188529

the names of the features (column headers)

>>> highway.keys()
[‘fielddeviceid’, ‘locationtimestamp’, ‘devicestatus’, ‘datastatus’, ‘locstatus’, ‘lastupdatetime’, ‘volume’, ‘speed’, ‘occupancy’]

and the first few entries:

>>> highway.scan()
fielddevicei locationtime devicestatus   datastatus    locstatus lastupdateti       volume        speed    occupancy
====================================================================================================================
'IL-TESTTSC-   1174330073 'NON_OPERATI 'FIELD_DATA_ 'LOCATION_RE 1175058600000            0            0            0
'IL-TESTTSC-   1174330073 'NON_OPERATI 'FIELD_DATA_ 'LOCATION_RE 1175058600000            0            0            0
'IL-TESTTSC-   1174330073 'NON_OPERATI 'FIELD_DATA_ 'LOCATION_RE 1175058600000            0            0            0
'IL-TESTTSC-   1174330073 'NON_OPERATI 'FIELD_DATA_ 'LOCATION_RE 1175058600000            0            0            0
'IL-TESTTSC-   1174330073 'NON_OPERATI 'FIELD_DATA_ 'LOCATION_RE 1175058600000            0            0            0
'IL-TESTTSC-   1174330073 'OPERATIONAL 'FIELD_DATA_ 'LOCATION_RE 1175058600000           78      22.2683      2.75029
'IL-TESTTSC-   1174330073 'OPERATIONAL 'FIELD_DATA_ 'LOCATION_RE 1175058600000          219       24.431      7.25412
'IL-TESTTSC-   1174330073 'OPERATIONAL 'FIELD_DATA_ 'LOCATION_RE 1175058600000           87      17.0116       3.7582
'IL-TESTTSC-   1174330073 'OPERATIONAL 'FIELD_DATA_ 'LOCATION_RE 1175058600000           61      18.4633      2.71703
'IL-TESTTSC-   1174330073 'OPERATIONAL 'FIELD_DATA_ 'LOCATION_RE 1175058600000           99      16.9542      4.30594

The “speed” field is probably pretty interesting; let’s take a look:

>>> #                      what to plot   cuts to apply
>>> view(highway.histogram("speed",       "speed > 0."))

http://cassius.googlecode.com/files/Tutorial2_10.png

(The `speed > 0.` requirement eliminates `NaNs`.)

Presumably, people drive on different streets at different speeds. The “fielddeviceid” column indicates the location where the speed was measured. How many of them are there?

>>> len(highway.unique("fielddeviceid"))
807

Too many to work with my hand— our analysis will need to be automated.

>>> # simplify the table to just arrays, selected with speed > 0.
>>> fielddeviceid, speed = highway("fielddeviceid, speed", "speed > 0.")

>>> # make a histogram for each "fielddeviceid" (segmentation)
>>> histograms = {}
>>> for f, s in itertools.izip(fielddeviceid, speed):
...     if f not in histograms:
...         histograms[f] = Histogram(70, 0., 70.)
...         histograms[f].name = f     # name is something I just made up
...     histograms[f].fill(s)
...

>>> # sort a list of these histograms by average speed
>>> histlist = histograms.values()
>>> histlist.sort(lambda a, b: cmp(a.mean(), b.mean()))

>>> # plot the average speeds of each histogram (with error bars)
>>> byspeed = Scatter(x=range(len(histlist)),
...                   y=[h.mean() for h in histlist],
...                   ey=[h.stdev()/math.sqrt(h.entries) for h in histlist])
...
>>> lines = Grid(horiz=[20., 30.])
>>> view(Overlay(lines, byspeed), xlabel="sorted 

http://cassius.googlecode.com/files/Tutorial2_20.png

There are three rough categories: slow roads (average is less than 20 MPH), fast roads (average is greater than 30 MPH), and the rest of them, which constitutes about 80%.

>>> # just look at the fast ones
>>> fast_ones = [h for h in histlist if h.mean() > 30.]

>>> # make a stack of histograms and give each one a different color
>>> colors = lightseries(len(fast_ones))
>>> for h, c in zip(fast_ones, colors):
...     h.fillcolor = c
...
>>> stack = Stack(*fast_ones, xlabel="speed")

>>> # make a legend linking colors to street names
>>> legend1 = Legend(names[:len(names)/2], justify="cl", textscale=0.5, x=0., anchor="tl", colwid=[0.2, 0.8])
>>> legend2 = Legend(names[len(names)/2:], justify="cl", textscale=0.5, x=1., anchor="tr", colwid=[0.2, 0.8])

>>> view(Overlay(legend1, legend2, stack, frame=-1))

http://cassius.googlecode.com/files/Tutorial2_30.png

That one at 60 MPH certainly stands out: it’s the Edens, which runs along Lake Michigan. I’ll just pick out a slow, a fast, and two weirdos.

>>> one = [h for h in fast_ones if h.name == "IL-TESTTSC-KENNEDY-W-2128"][0]
>>> two = [h for h in fast_ones if h.name == "IL-TESTTSC-ROUTE_53-N-11109"][0]
>>> three = [h for h in fast_ones if h.name == "IL-TESTTSC-I_57-N-9017"][0]
>>> four = [h for h in fast_ones if h.name == "IL-TESTTSC-EDENS-S-1037"][0]

>>> # format them nicely for viewing
>>> for h in one, two, three, four:
...     h.xlabel = h.name
...     h.bottommargin = 0.2
...     h.xlabeloffset = 0.2
...
>>> view(Layout(2, 2, one, two, three, four))

http://cassius.googlecode.com/files/Tutorial2_40.png

That bimodal distribution in the Kennedy, I wonder if it might be related to time.

>>> speedvstime = highway.timeseries("lastupdatetime, speed",
...                                  "fielddeviceid == 'IL-TESTTSC-KENNEDY-W-2128'",
...                                  outformat="%b %y")
>>> view(speedvstime)

http://cassius.googlecode.com/files/Tutorial2_50.png

Yup. It is. That’s probably a data-quality issue— cars don’t oscillate sharply between 15 MPH and 40 MPH on a yearly basis.

We could go on and on; there are a lot of interesting features in this dataset. However, we have seen how Cassius provides a flexible, interactive way to do data analysis. Much more detail on each class and function is provided in the [http://174.129.21.118/cassius/html/ full documentation].

Clone this wiki locally