Improve README

CAIDA · Apr 22, 2021 · 092765b · 092765b
1 parent 439f952
commit 092765b
Showing 1 changed file with 73 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -4,17 +4,87 @@ This package provides an interface for fast processing of the STARDUST avro
 data files using python.
 
 Data formats that are currently supported are:
- * flowtuple data
- * RSDOS attack data --- coming soon
+ * flowtuple v3 data
+ * flowtuple v4 data
+ * RSDOS attack data
 
 ### Installation
 
+Dependencies:
+  * pywandio -- https://github.com/CAIDA/pywandio (note: STARDUST users should
+    already have the python[3]-pywandio package installed on their VM)
+  * cython
+
+
 ```
-make install
+make && make install
 ```
 
 or
 
 ```
 USE_CYTHON=1 pip install --user .
 ```
+
+### Examples
+Simple example programs that demonstrate the API for each of the
+supported formats can be found in the `examples/` directory.
+
+
+### General Usage
+
+I strongly recommend having the code for one of the examples available
+when you read this section, as it should help clarify much of what is
+being explained here.
+
+---
+
+Step 1: Create an instance of a reader for the data format that you wish to
+read, passing in a valid wandio path to the file that you wish to read
+as a parameter (a swift URI or a path to a file).
+
+Examples of valid reader instances are: `AvroFlowtuple3Reader`,
+`AvroFlowtuple4Reader`, and `AvroRsdosReader`.
+
+Step 2: Invoke the `start()` method for the reader instance.
+
+Step 3: Define a callback method that you wish to be invoked for each Avro
+record that has been read from your input file.
+
+The method must take two arguments: the record itself and a `userarg`
+parameter. The `userarg` parameter provides a way for you to pass in
+additional arguments to the callback method from outside of the scope
+of the callback method.
+
+There are some common methods that are available for any Avro record object:
+ * `asDict()` -- returns all fields in the Avro record as a python dictionary
+    (key = field name, value = field value).
+ * `getNumeric(attributeId)` -- returns the value for a specific field that
+   has a numeric value (e.g. IP address, port, counter, timestamp, etc.)
+ * `getString(attributeId)` -- returns the value for a specific field that has
+   a string value (e.g. a geo-location tag)
+ * `getNumericArray(attributeId)` -- returns a list of values that have been
+   stored in the record as an array of numbers
+
+The attributeIds for each data format are listed on the pyavro-stardust wiki
+at https://github.com/CAIDA/pyavro-stardust/wiki/Supported-Data-Formats
+
+In terms of efficiency, I would recommend using `asDict()` if your callback
+function needs to access more than 3 different fields in the record, as the
+function call overhead of calling methods like `getNumeric()` multiple times
+will quickly add up to exceed the cost of calling `asDict() once and having
+every value available.
+
+Step 4: Invoke the `perAvroRecord()` method on your reader instance, passing
+in your callback function name as the first argument. If your callback is
+going to make use of the `userarg` parameter, then the intended value for
+`userarg` should be passed in as the optional second argument.
+
+This function call will only complete once your callback has been applied to
+every individual Avro record in the input file.
+
+Step 5: Invoke the close() method on your reader instance.
+
+Step 6: Do any final post-processing or output writing that your analysis
+requires.
+