Skip to content

Commit

Permalink
Merge pull request #5 from CAIDA/improve-docs
Browse files Browse the repository at this point in the history
Improve docs and licensing
  • Loading branch information
salcock authored Apr 22, 2021
2 parents 9839641 + 59d79a3 commit b166c09
Show file tree
Hide file tree
Showing 12 changed files with 418 additions and 5 deletions.
33 changes: 33 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# This software is Copyright © 2021 The Regents of the University of
# California. All Rights Reserved. Permission to copy, modify, and distribute
# this software and its documentation for educational, research and non-profit
# purposes, without fee, and without a written agreement is hereby granted,
# provided that the above copyright notice, this paragraph and the following
# three paragraphs appear in all copies. Permission to make commercial use of
# this software may be obtained by contacting:
#
# Office of Innovation and Commercialization
# 9500 Gilman Drive, Mail Code 0910
# University of California
# La Jolla, CA 92093-0910
# (858) 534-5815
# [email protected]
#
# This software program and documentation are copyrighted by The Regents of the
# University of California. The software program and documentation are supplied
# "as is", without any accompanying services from The Regents. The Regents does
# not warrant that the operation of the program will be uninterrupted or
# error-free. The end-user understands that the program was developed for
# research purposes and is advised not to rely exclusively on the program for
# any reason.
#
# IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION,
# EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE. THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED
# HEREUNDER IS ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO
# OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
# MODIFICATIONS.
76 changes: 73 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,87 @@ This package provides an interface for fast processing of the STARDUST avro
data files using python.

Data formats that are currently supported are:
* flowtuple data
* RSDOS attack data --- coming soon
* flowtuple v3 data
* flowtuple v4 data
* RSDOS attack data

### Installation

Dependencies:
* pywandio -- https://github.com/CAIDA/pywandio (note: STARDUST users should
already have the python[3]-pywandio package installed on their VM)
* cython


```
make install
make && make install
```

or

```
USE_CYTHON=1 pip install --user .
```

### Examples
Simple example programs that demonstrate the API for each of the
supported formats can be found in the `examples/` directory.


### General Usage

I strongly recommend having the code for one of the examples available
when you read this section, as it should help clarify much of what is
being explained here.

---

Step 1: Create an instance of a reader for the data format that you wish to
read, passing in a valid wandio path to the file that you wish to read
as a parameter (a swift URI or a path to a file).

Examples of valid reader instances are: `AvroFlowtuple3Reader`,
`AvroFlowtuple4Reader`, and `AvroRsdosReader`.

Step 2: Invoke the `start()` method for the reader instance.

Step 3: Define a callback method that you wish to be invoked for each Avro
record that has been read from your input file.

The method must take two arguments: the record itself and a `userarg`
parameter. The `userarg` parameter provides a way for you to pass in
additional arguments to the callback method from outside of the scope
of the callback method.

There are some common methods that are available for any Avro record object:
* `asDict()` -- returns all fields in the Avro record as a python dictionary
(key = field name, value = field value).
* `getNumeric(attributeId)` -- returns the value for a specific field that
has a numeric value (e.g. IP address, port, counter, timestamp, etc.)
* `getString(attributeId)` -- returns the value for a specific field that has
a string value (e.g. a geo-location tag)
* `getNumericArray(attributeId)` -- returns a list of values that have been
stored in the record as an array of numbers

The attributeIds for each data format are listed on the pyavro-stardust wiki
at https://github.com/CAIDA/pyavro-stardust/wiki/Supported-Data-Formats

In terms of efficiency, I would recommend using `asDict()` if your callback
function needs to access more than 3 different fields in the record, as the
function call overhead of calling methods like `getNumeric()` multiple times
will quickly add up to exceed the cost of calling `asDict()` once and having
every value available.

Step 4: Invoke the `perAvroRecord()` method on your reader instance, passing
in your callback function name as the first argument. If your callback is
going to make use of the `userarg` parameter, then the intended value for
`userarg` should be passed in as the optional second argument.

This function call will only complete once your callback has been applied to
every individual Avro record in the input file.

Step 5: Invoke the `close()` method on your reader instance.

Step 6: Do any final post-processing or output writing that your analysis
requires.

5 changes: 4 additions & 1 deletion examples/flowtuple4-example.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Example code that uses the AvroFlowtuple3Reader extension class to
# Example code that uses the AvroFlowtuple4Reader extension class to
# count flowtuples via a perFlowtuple callback method

import sys
Expand All @@ -17,6 +17,9 @@
# Incredibly simple callback that simply increments a global counter for
# each flowtuple, as well as tracking the number of packets for each
# IP protocols
#
# We also report some stats on the most common TTLs, packet sizes and TCP flag
# combinations that our flowtuple records contain
def perFlowtupleCallback(ft, userarg):
global counter, protocols
counter += 1
Expand Down
35 changes: 35 additions & 0 deletions examples/rsdos-example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Example code that uses the AvroRsdosReader extension class to count
# DOS attacks via a perDos callback method

import sys
from pyavro_stardust.rsdos import AvroRsdosReader, RsdosAttribute, \
AvroRsdos

count = 0
attack_pkts = 0

def perDosCallback(rsdos, userarg):
global count, attack_pkts

count += 1
dos = rsdos.asDict()
attack_pkts += dos['packet_count']

# Ideally, we'd do things with the other fields in 'dos' as well,
# but this is just intended to be a very simple example

def run():
# sys.argv[1] must be a valid wandio path -- e.g. a swift URL or
# a path to a file on disk
reader = AvroRsdosReader(sys.argv[1])
reader.start()

# This will read all of the attack records and call `perDosCallback` on
# each one
reader.perAvroRecord(perDosCallback)
reader.close()

# Display our final results
print("Attacks", count, " Packets:", attack_pkts)

run()
34 changes: 34 additions & 0 deletions src/pyavro_stardust/baseavro.pxd
Original file line number Diff line number Diff line change
@@ -1,3 +1,37 @@
# This software is Copyright (C) 2021 The Regents of the University of
# California. All Rights Reserved. Permission to copy, modify, and distribute
# this software and its documentation for educational, research and non-profit
# purposes, without fee, and without a written agreement is hereby granted,
# provided that the above copyright notice, this paragraph and the following
# three paragraphs appear in all copies. Permission to make commercial use of
# this software may be obtained by contacting:
#
# Office of Innovation and Commercialization
# 9500 Gilman Drive, Mail Code 0910
# University of California
# La Jolla, CA 92093-0910
# (858) 534-5815
# [email protected]
#
# This software program and documentation are copyrighted by The Regents of the
# University of California. The software program and documentation are supplied
# "as is", without any accompanying services from The Regents. The Regents does
# not warrant that the operation of the program will be uninterrupted or
# error-free. The end-user understands that the program was developed for
# research purposes and is advised not to rely exclusively on the program for
# any reason.
#
# IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION,
# EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE. THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED
# HEREUNDER IS ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO
# OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
# MODIFICATIONS.

from libcpp.vector cimport vector
from cpython cimport array
import array
Expand Down
34 changes: 34 additions & 0 deletions src/pyavro_stardust/baseavro.pyx
Original file line number Diff line number Diff line change
@@ -1,3 +1,37 @@
# This software is Copyright (C) 2021 The Regents of the University of
# California. All Rights Reserved. Permission to copy, modify, and distribute
# this software and its documentation for educational, research and non-profit
# purposes, without fee, and without a written agreement is hereby granted,
# provided that the above copyright notice, this paragraph and the following
# three paragraphs appear in all copies. Permission to make commercial use of
# this software may be obtained by contacting:
#
# Office of Innovation and Commercialization
# 9500 Gilman Drive, Mail Code 0910
# University of California
# La Jolla, CA 92093-0910
# (858) 534-5815
# [email protected]
#
# This software program and documentation are copyrighted by The Regents of the
# University of California. The software program and documentation are supplied
# "as is", without any accompanying services from The Regents. The Regents does
# not warrant that the operation of the program will be uninterrupted or
# error-free. The end-user understands that the program was developed for
# research purposes and is advised not to rely exclusively on the program for
# any reason.
#
# IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION,
# EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE. THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED
# HEREUNDER IS ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO
# OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
# MODIFICATIONS.

# cython: language_level=3
from libc.string cimport memcpy
from libcpp.vector cimport vector
Expand Down
34 changes: 34 additions & 0 deletions src/pyavro_stardust/flowtuple3.pxd
Original file line number Diff line number Diff line change
@@ -1,3 +1,37 @@
# This software is Copyright (C) 2021 The Regents of the University of
# California. All Rights Reserved. Permission to copy, modify, and distribute
# this software and its documentation for educational, research and non-profit
# purposes, without fee, and without a written agreement is hereby granted,
# provided that the above copyright notice, this paragraph and the following
# three paragraphs appear in all copies. Permission to make commercial use of
# this software may be obtained by contacting:
#
# Office of Innovation and Commercialization
# 9500 Gilman Drive, Mail Code 0910
# University of California
# La Jolla, CA 92093-0910
# (858) 534-5815
# [email protected]
#
# This software program and documentation are copyrighted by The Regents of the
# University of California. The software program and documentation are supplied
# "as is", without any accompanying services from The Regents. The Regents does
# not warrant that the operation of the program will be uninterrupted or
# error-free. The end-user understands that the program was developed for
# research purposes and is advised not to rely exclusively on the program for
# any reason.
#
# IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION,
# EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE. THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED
# HEREUNDER IS ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO
# OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
# MODIFICATIONS.

import cython
from pyavro_stardust.baseavro cimport AvroRecord, AvroReader

Expand Down
34 changes: 34 additions & 0 deletions src/pyavro_stardust/flowtuple3.pyx
Original file line number Diff line number Diff line change
@@ -1,3 +1,37 @@
# This software is Copyright (C) 2021 The Regents of the University of
# California. All Rights Reserved. Permission to copy, modify, and distribute
# this software and its documentation for educational, research and non-profit
# purposes, without fee, and without a written agreement is hereby granted,
# provided that the above copyright notice, this paragraph and the following
# three paragraphs appear in all copies. Permission to make commercial use of
# this software may be obtained by contacting:
#
# Office of Innovation and Commercialization
# 9500 Gilman Drive, Mail Code 0910
# University of California
# La Jolla, CA 92093-0910
# (858) 534-5815
# [email protected]
#
# This software program and documentation are copyrighted by The Regents of the
# University of California. The software program and documentation are supplied
# "as is", without any accompanying services from The Regents. The Regents does
# not warrant that the operation of the program will be uninterrupted or
# error-free. The end-user understands that the program was developed for
# research purposes and is advised not to rely exclusively on the program for
# any reason.
#
# IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION,
# EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE. THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED
# HEREUNDER IS ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO
# OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
# MODIFICATIONS.

# cython: language_level=3
cimport cython
from pyavro_stardust.baseavro cimport AvroRecord, read_long, read_string, \
Expand Down
34 changes: 34 additions & 0 deletions src/pyavro_stardust/flowtuple4.pxd
Original file line number Diff line number Diff line change
@@ -1,3 +1,37 @@
# This software is Copyright (C) 2021 The Regents of the University of
# California. All Rights Reserved. Permission to copy, modify, and distribute
# this software and its documentation for educational, research and non-profit
# purposes, without fee, and without a written agreement is hereby granted,
# provided that the above copyright notice, this paragraph and the following
# three paragraphs appear in all copies. Permission to make commercial use of
# this software may be obtained by contacting:
#
# Office of Innovation and Commercialization
# 9500 Gilman Drive, Mail Code 0910
# University of California
# La Jolla, CA 92093-0910
# (858) 534-5815
# [email protected]
#
# This software program and documentation are copyrighted by The Regents of the
# University of California. The software program and documentation are supplied
# "as is", without any accompanying services from The Regents. The Regents does
# not warrant that the operation of the program will be uninterrupted or
# error-free. The end-user understands that the program was developed for
# research purposes and is advised not to rely exclusively on the program for
# any reason.
#
# IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION,
# EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE. THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED
# HEREUNDER IS ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO
# OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR
# MODIFICATIONS.

import cython
from pyavro_stardust.baseavro cimport AvroRecord, AvroReader

Expand Down
Loading

0 comments on commit b166c09

Please sign in to comment.