ground-context
diff --git a/‎Dockerfile
Lines changed: 7 additions & 16 deletions b/‎Dockerfile
Lines changed: 7 additions & 16 deletions
diff --git a/‎notebooks/Ground-00.ipynb renamed to ‎Ground-00.ipynb
Lines changed: 2 additions & 2 deletions b/‎notebooks/Ground-00.ipynb renamed to ‎Ground-00.ipynb
Lines changed: 2 additions & 2 deletions
diff --git a/‎notebooks/Ground-01.ipynb renamed to ‎Ground-01.ipynb
Lines changed: 19 additions & 5 deletions b/‎notebooks/Ground-01.ipynb renamed to ‎Ground-01.ipynb
Lines changed: 19 additions & 5 deletions
diff --git a/‎notebooks/Ground-02.ipynb renamed to ‎Ground-02.ipynb
Lines changed: 21 additions & 7 deletions b/‎notebooks/Ground-02.ipynb renamed to ‎Ground-02.ipynb
Lines changed: 21 additions & 7 deletions
diff --git a/‎notebooks/Ground-03.ipynb renamed to ‎Ground-03.ipynb
Lines changed: 1 addition & 1 deletion b/‎notebooks/Ground-03.ipynb renamed to ‎Ground-03.ipynb
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md
Lines changed: 5 additions & 0 deletions b/‎README.md
Lines changed: 5 additions & 0 deletions
diff --git a/‎aboveground/__init__.py b/‎aboveground/__init__.py
diff --git a/‎aboveground/ground_file_client.py
Lines changed: 61 additions & 0 deletions b/‎aboveground/ground_file_client.py
Lines changed: 61 additions & 0 deletions
@@ -32,32 +32,21 @@ RUN sed 's/md5/trust/g' test.out > test2.out
 RUN mv test2.out /etc/postgresql/9.5/main/pg_hba.conf 
 RUN rm test.out
 
-
+# install ground
 RUN wget https://github.com/ground-context/ground/releases/download/v0.1.2/ground-0.1.2.zip
 RUN unzip ground-0.1.2.zip
 RUN rm ground-0.1.2.zip
 RUN service postgresql start && sudo su -c "createuser ground -d -s" -s /bin/sh postgres  && sudo su -c "createdb ground" -s /bin/sh postgres && sudo su -c "createuser root -d -s" -s /bin/sh postgres && sudo su -c "createuser $NB_USER -d -s" -s /bin/sh postgres
 RUN service postgresql start && cd ground-0.1.2/db && python2.7 postgres_setup.py ground ground 
 
 # miscellaneous installs
-RUN apt-get install -y python3-pip
-RUN pip3 install pandas
-RUN pip3 install numpy
-RUN pip3 install requests
-
-RUN apt-get install -y python-pip
-RUN pip2 install psycopg2
-RUN pip2 install requests
-RUN pip2 install numpy
-RUN pip2 install pandas
-RUN pip2 install tweet_preprocessor
-RUN pip2 install scipy
+RUN apt-get install -y python3-pip python-pip
+RUN pip3 install pandas numpy requests
+RUN pip2 install psycopg2 requests numpy pandas tweet_preprocessor scipy HTMLParser
 RUN pip2 install -U scikit-learn
-RUN pip2 install HTMLParser
 
 # install git & tmux
-RUN apt-get install -y git 
-RUN apt-get install -y tmux
+RUN apt-get install -y git tmux
 
 RUN git clone https://github.com/ground-context/client
 RUN cd client/python && python setup.py install
@@ -83,3 +72,5 @@ RUN chown -R $NB_USER /home/$NB_USER/ground-0.1.2/db
 RUN chown -R $NB_USER /home/$NB_USER/risecamp/
 
 CMD cd /home/$NB_USER && ./ground_start.sh
+
+ENV NB_GROUND_HOME /home/$NB_USER
@@ -11,13 +11,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For background, please see the slides from this morning's talk on Ground. You can find them [here](). (***TODO: Add link to slides.***) This Jupyter notebook is running in a Docker container that already has a Ground instance as well as a Postgres server up and running. There isn't any more set up for us to do, so let's jump right in!\n",
+    "For background, please see the slides from this morning's talk on Ground. You can find them [here](https://www.dropbox.com/s/auw9y2p8o0kdjis/%5B2017-09-08%5D%20Ground%20RISE%20Camp.key?dl=1). This Jupyter notebook is running in a Docker container that already has a Ground instance as well as a Postgres server up and running. There isn't any more set up for us to do, so let's jump right in!\n",
     "\n",
     "In this tutorial, we will first introduce the basic concepts of Ground by walking through an instrumented analytics scenario. We will use Ground to track git commits and some simple data. We will run the code in the git repo on the data and automatically publish some lineage information into Ground. We will use the information automatically sent to Ground to inspect the lineage and make sure everything happened as we expected.\n",
     "\n",
     "Next, we will look at managing machine learning models with Ground as a specific case study and explore how one might use Ground to debug unexpected problems efficiently and simply.\n",
     "\n",
-    "Lastly, we will look at how to use the Ground Python client to build a simple Aboveground application that takes in a directory and automatically publishes data context about the files in that directory to Ground."
+    "Last, we will look at how to use the Ground Python client to build a simple Aboveground application that takes in a directory and automatically publishes data context about the files in that directory to Ground."
    ]
   }
  ],
 
@@ -22,15 +22,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "![The 2 Layers in Exercise 1](images/2-layer.png)"
+    "<img src=\"images/2-layer.png\" width=400 alt=\"The 2 Layers in Exercise 1\">"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "\n",
-    "To get started with Ground, we will use some of the \"Aboveground\" services that we have already developed. Aboveground services are tools that users use to interface with Ground at a higher semantic level than the simple node-and-edge-based API. We will first add the commit history of a git repository into Ground. Then, we'll tell Ground about some of the data that's contained in that repository. Finally, we will run the code contained in the repository. At the end of this section, we will see that Ground kept track of the code and the data; it will also track the execution of the code and understand the lineage between the original data and the cleaned data.\n",
+    "To get started with Ground, we will use some \"Aboveground\" applications that were written for this tutorial. Aboveground applications allow users to interface with Ground at a higher semantic level than the general-purpose node-and-edge-based API. \n",
+    "\n",
+    "We will first add the commit history of a git repository into Ground. Then, we'll tell Ground about some of the data that's contained in that repository. Finally, we will run the code contained in the repository. At the end of this section, we will see that Ground kept track of the code and the data; it will also track the execution of the code and understand the lineage between the original data and the cleaned data.\n",
     "\n",
     "The cell below contains a call to the `ground_git_client`, which is an Aboveground app that interfaces with a Github repo. Run the cell below to capture and publish git information into Ground."
    ]
@@ -74,7 +76,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now that we have some code that Ground is aware of, we are going to want to do something with code. The particular repository that we populated has a simple **Python transformation script** that is \"Ground-aware\"\\*, as well a small amount of data for us to analyze in the form of a CSV file. You can find the repository online [here](https://github.com/ground-context/risecamp).\n",
+    "Now that we have some code that Ground is aware of, we are going to want to do something with that code. The particular repository that we populated has a simple **Python transformation script** that is \"Ground-aware\"\\*, as well a small amount of data for us to analyze in the form of a CSV file. You can find the repository online [here](https://github.com/ground-context/risecamp).\n",
     "\n",
     "Next, we need to make sure that Ground knows about the base dataset that we are going to use. Using another Aboveground tool that we have already developed, we can automatically tell Ground about this new dataset. This tool will populate Ground with some useful information about the file, such as the file type, the size of the file, and the path to the file.\n",
     "\n",
@@ -99,9 +101,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Great! Now Ground knows about our base dataset. We see here all the information that Ground has about `data.txt`. Notice the ID that was assigned by Ground to this dataset. We're going to use it again in a minute.\n",
+    "Great! Now Ground knows about our base dataset. In the output of the previous cell we see all the information that Ground has about `data.txt`. Notice the ID that was assigned by Ground to this dataset. We're going to use it again in a minute.\n",
     "\n",
-    "Finally, we're going to run the transformation script in the repository. Since the script that we are using is Ground-aware, it is going to generate lineage automatically information in Ground as a part of transforming the data. It will tell Ground that it has created a new dataset based on the old input dataset, and it will associate this lineage information with the latest version of the source code that was used for the transformation."
+    "Finally, we're going to run the transformation script in the repository. Since the script that we are using is Ground-aware, it is going to generate lineage information automatically in Ground as a part of transforming the data. It will tell Ground that it has created a new dataset based on the old input dataset, and it will associate this lineage information with the latest version of the source code that was used for the transformation."
    ]
   },
   {
@@ -177,6 +179,18 @@
     "\n",
     "At this point you have seen how we can model application context in Ground as nodes and edges, and how behavioral context is captured through separate lineage edges."
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# if you run into any problems, run this cell to reset Ground\n",
+    "!bash ../reset_ground.sh >> /dev/null"
+   ]
   }
  ],
  "metadata": {
 
@@ -20,7 +20,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "![The Full 3-Layer Data Model](images/3-layer.png)"
+    "<img src=\"images/3-layer.png\", width=400, alt=\"The Full 3-Layer Data Model\"/>"
    ]
   },
   {
@@ -36,7 +36,7 @@
    "source": [
     "Stepping into the specifics of our exercise, we provide some Aboveground tools that call into the Ground API, and allow us to easily interact with and debug ML models. Throughout this exercise, we will be using these higher-level tools instead of interacting directly with Ground. \n",
     "\n",
-    "We're first going to look at the working version of our Twitter modeling pipeline and understand the context that is captured in Ground. Next, we'll see that the unexpected change in the feed format that causes a significant degradation in prediction quality. We will then use Ground to understand what changed & why and fix the bug."
+    "We're first going to look at the working version of our Twitter modeling pipeline and understand the context that is captured in Ground. Second, we'll see an unexpected change in the feed format that causes a significant degradation in prediction quality. We will then use Ground to understand what changed & why and fix the bug."
    ]
   },
   {
@@ -97,7 +97,7 @@
     "* `show_all_model_versions`: return the information about all the models Ground knows about.\n",
     "* `show_model_dependencies`: return the code & data dependencies for a particular model; if no **Ground version** is passed in, the default is assumed to be the most recent version of the model.\n",
     "* `show_data_schema`: prints out the schema of the dataset the model is trained on; if no **Ground version** is specified, prints out the most recent version of the schema.\n",
-    "* `diff_deta_schemas`: takes in the **Ground versions** of two data schemas and prints out the differences between them."
+    "* `diff_data_schemas`: takes in the **Ground versions** of two data schemas and prints out the differences between them."
    ]
   },
   {
@@ -144,7 +144,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Okay, so we have a baseline model, and in fact it does pretty well: 60% accuracy. To see why that's pretty good, let's compare to what a *random guess* model would achieve. There are about 170 different countries that we have tweets from in this dataset. If we were to guess _totally_ at random, we would expect $\\frac{1}{170} = 0.6\\%$ accuracy. Even without machine learning, we can do better. As you might have guessed, the plurality of tweets in the world comes from the United States. Thus, we can say that the default case would be to always guess the United States, which would give us about 35% accuracy. (That is the proportion of tweets that comes from the United States.) \n",
+    "At this point we have a baseline model, and in fact it does pretty well: 60% accuracy. To see why that's pretty good, let's compare to what a *random guess* model would achieve. There are about 170 different countries that we have tweets from in this dataset. If we were to guess _totally_ at random, we would expect $\\frac{1}{170} = 0.6\\%$ accuracy. Even without machine learning, we can do better. As you might have guessed, the plurality of tweets in the world come from the United States. Thus, we can say that the default case would be to always guess the United States, which would give us about 35% accuracy. (That is the proportion of tweets that comes from the United States.) \n",
     "\n",
     "Next, we're going to inspect the data context -- in particular, the lineage information -- to understand what exactly Ground has learned here. Follow the steps below."
    ]
@@ -157,7 +157,7 @@
    },
    "outputs": [],
    "source": [
-    "# first, let's retrieve the most recent model version to see what Ground knows\n",
+    "# first, let's retrieve the most recent model version since that's the source of our score\n",
     "# we should see that there's a model number associated with each new version of the model"
    ]
   },
@@ -232,9 +232,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Unfortunately, it seems to be true. Something has changed pretty significantly. We've gone from 60% accuracy down to 35%. Remember from above that this is no better than our intelligent random guess. It's clear that something external to our own pipeline has changed, causing the sudden drop from fairly accurate to no-better-than-random.\n",
+    "Unfortunately, it seems to be true. Something has changed pretty significantly. We've gone from 60% accuracy down to 35%. Remember from above that this is no better than our intelligent random guess. We haven't changed anything in our pipeline, so something external must have changed, and caused the sudden drop from fairly accurate to no-better-than-random.\n",
     "\n",
-    "The question we have to answer next is what changed that caused our prediction quality to degrade. We can imagine a long list of things that might have changed. If you're stuck, we've written a description below that will help walk you through the investigative steps. \n",
+    "The question we have to answer next is what changed that caused our prediction quality to degrade. We can imagine a long list of things that might have changed, but the context stored in Ground will make it fairly easy to narrow our focus down to the real culprit. \n",
+    "\n",
+    "Your task is to use the aboveground helper functions listed above to identify the root causes of the degraded prediction quality, and remedy it (them). If you're stuck, we've written a description below that will help walk you through the investigative steps. \n",
     "\n",
     "**HINTS**: \n",
     "\n",
@@ -346,6 +348,18 @@
    "source": [
     "The full solution is provided [here](https://github.com/ground-context/tutorial/blob/master/solutions/Ground-02.ipynb)."
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# if you run into any problems, run this cell to reset Ground\n",
+    "!bash ../reset_ground.sh >> /dev/null"
+   ]
   }
  ],
  "metadata": {
 
@@ -36,7 +36,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Before we start writing our own Aboveground tool, let's first dig deeper into the workings of the `ground_file_client` from Exercise 1. Let's begin by opening the [`ground_file_client.py`](http://localhost:8888/edit/aboveground/ground_file_client.py) file in another tab. After walking through the comments there, return here to continue with the exercises."
+    "Before we start writing our own Aboveground tool, let's first dig deeper into the workings of the `ground_file_client` from Exercise 1. Let's begin by opening the [`ground_file_client.py`](http://localhost:8888/edit/ground/aboveground/ground_file_client.py) file in another tab. After walking through the comments there, return here to continue with the exercises."
    ]
   },
   {
 
@@ -0,0 +1,5 @@
+# Ground Tutorials
+
+This repository contains the infrastructure and Dockerfile for the introductory Ground tutorials. The first tutorial walks users through a simple instrumented analytics scenario; the second tutorial walks users through debugging an unexpected change in a machine learning pipeline. The last exercise will show users how to build their own Aboveground application.
+
+A compiled version of the Docker image is on Dockerhub. You can pull it by running `docker pull groundcontext/tutorial`. To start the image, run `docker run --rm --it -p 8888:8888 groundcontext/tutorial`.
@@ -0,0 +1,61 @@
+from ground.client import GroundClient
+import os
+
+gc = GroundClient()
+
+def add_file(file_path):
+    # get file system information
+    stat = os.stat(file_path)
+
+    # get the file size
+    size = stat[6]
+
+    # get the file creation time
+    ctime = stat[-1]
+
+    # add the relevant information to the tags in Javascript
+    tags = {
+        'size': {
+            'key': 'size',
+            'value': size,
+            'type': 'integer'
+        },
+        'ctime': {
+            'key': 'ctime',
+            'value': ctime,
+            'type': 'integer'
+        },
+        'path': {
+            'key': 'path',
+            'value': file_path,
+            'type': 'string'
+        }
+    }
+
+    # either retrieve an existing structure or create a new one
+    sv_id = create_structure()
+
+    # get the name of the file
+    file_path = file_path.split('/')[-1]
+
+    # create a new node
+    node_id = gc.createNode(file_path, file_path, {})['id']
+
+    # create and return the node version; the empty array is the list of the
+    # parent versions of this version (i.e., none)
+    node_version = gc.createNodeVersion(node_id, tags=tags, structure_version_id=sv_id)
+    return node_version
+
+
+def create_structure():
+    # attempt to retrieve the structure
+    struct = gc.getStructure("dataset")
+
+    if struct == None:
+        # if it does not exist, create a new structure and structure version
+        structure_id = gc.createStructure("dataset", "dataset", {})['id']
+        return gc.createStructureVersion(structure_id, {"size": "integer", "ctime": "integer", "path": "string"})['id']
+    else:
+        # if it already exists, return the most recent version of it
+        sv_id = gc.getStructureLatestVersions("dataset")[0]
+        return gc.getStructureVersion(sv_id)
Original file line number	Diff line number	Diff line change
`@@ -36,7 +36,7 @@`
`36`	`36`	`"cell_type": "markdown",`
`37`	`37`	`"metadata": {},`
`38`	`38`	`"source": [`
`39`		- "Before we start writing our own Aboveground tool, let's first dig deeper into the workings of the `ground_file_client` from Exercise 1. Let's begin by opening the [`ground_file_client.py`](http://localhost:8888/edit/aboveground/ground_file_client.py) file in another tab. After walking through the comments there, return here to continue with the exercises."
	`39`	+ "Before we start writing our own Aboveground tool, let's first dig deeper into the workings of the `ground_file_client` from Exercise 1. Let's begin by opening the [`ground_file_client.py`](http://localhost:8888/edit/ground/aboveground/ground_file_client.py) file in another tab. After walking through the comments there, return here to continue with the exercises."
`40`	`40`	`]`
`41`	`41`	`},`
`42`	`42`	`{`