Skip to content

Commit 5fddaf4

Browse files
committed
Add tutorial infrastructure, README
1 parent 045b97c commit 5fddaf4

36 files changed

+161055
-31
lines changed

Dockerfile

+7-16
Original file line numberDiff line numberDiff line change
@@ -32,32 +32,21 @@ RUN sed 's/md5/trust/g' test.out > test2.out
3232
RUN mv test2.out /etc/postgresql/9.5/main/pg_hba.conf
3333
RUN rm test.out
3434

35-
35+
# install ground
3636
RUN wget https://github.com/ground-context/ground/releases/download/v0.1.2/ground-0.1.2.zip
3737
RUN unzip ground-0.1.2.zip
3838
RUN rm ground-0.1.2.zip
3939
RUN service postgresql start && sudo su -c "createuser ground -d -s" -s /bin/sh postgres && sudo su -c "createdb ground" -s /bin/sh postgres && sudo su -c "createuser root -d -s" -s /bin/sh postgres && sudo su -c "createuser $NB_USER -d -s" -s /bin/sh postgres
4040
RUN service postgresql start && cd ground-0.1.2/db && python2.7 postgres_setup.py ground ground
4141

4242
# miscellaneous installs
43-
RUN apt-get install -y python3-pip
44-
RUN pip3 install pandas
45-
RUN pip3 install numpy
46-
RUN pip3 install requests
47-
48-
RUN apt-get install -y python-pip
49-
RUN pip2 install psycopg2
50-
RUN pip2 install requests
51-
RUN pip2 install numpy
52-
RUN pip2 install pandas
53-
RUN pip2 install tweet_preprocessor
54-
RUN pip2 install scipy
43+
RUN apt-get install -y python3-pip python-pip
44+
RUN pip3 install pandas numpy requests
45+
RUN pip2 install psycopg2 requests numpy pandas tweet_preprocessor scipy HTMLParser
5546
RUN pip2 install -U scikit-learn
56-
RUN pip2 install HTMLParser
5747

5848
# install git & tmux
59-
RUN apt-get install -y git
60-
RUN apt-get install -y tmux
49+
RUN apt-get install -y git tmux
6150

6251
RUN git clone https://github.com/ground-context/client
6352
RUN cd client/python && python setup.py install
@@ -83,3 +72,5 @@ RUN chown -R $NB_USER /home/$NB_USER/ground-0.1.2/db
8372
RUN chown -R $NB_USER /home/$NB_USER/risecamp/
8473

8574
CMD cd /home/$NB_USER && ./ground_start.sh
75+
76+
ENV NB_GROUND_HOME /home/$NB_USER

notebooks/Ground-00.ipynb Ground-00.ipynb

+2-2
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,13 @@
1111
"cell_type": "markdown",
1212
"metadata": {},
1313
"source": [
14-
"For background, please see the slides from this morning's talk on Ground. You can find them [here](). (***TODO: Add link to slides.***) This Jupyter notebook is running in a Docker container that already has a Ground instance as well as a Postgres server up and running. There isn't any more set up for us to do, so let's jump right in!\n",
14+
"For background, please see the slides from this morning's talk on Ground. You can find them [here](https://www.dropbox.com/s/auw9y2p8o0kdjis/%5B2017-09-08%5D%20Ground%20RISE%20Camp.key?dl=1). This Jupyter notebook is running in a Docker container that already has a Ground instance as well as a Postgres server up and running. There isn't any more set up for us to do, so let's jump right in!\n",
1515
"\n",
1616
"In this tutorial, we will first introduce the basic concepts of Ground by walking through an instrumented analytics scenario. We will use Ground to track git commits and some simple data. We will run the code in the git repo on the data and automatically publish some lineage information into Ground. We will use the information automatically sent to Ground to inspect the lineage and make sure everything happened as we expected.\n",
1717
"\n",
1818
"Next, we will look at managing machine learning models with Ground as a specific case study and explore how one might use Ground to debug unexpected problems efficiently and simply.\n",
1919
"\n",
20-
"Lastly, we will look at how to use the Ground Python client to build a simple Aboveground application that takes in a directory and automatically publishes data context about the files in that directory to Ground."
20+
"Last, we will look at how to use the Ground Python client to build a simple Aboveground application that takes in a directory and automatically publishes data context about the files in that directory to Ground."
2121
]
2222
}
2323
],

notebooks/Ground-01.ipynb Ground-01.ipynb

+19-5
Original file line numberDiff line numberDiff line change
@@ -22,15 +22,17 @@
2222
"cell_type": "markdown",
2323
"metadata": {},
2424
"source": [
25-
"![The 2 Layers in Exercise 1](images/2-layer.png)"
25+
"<img src=\"images/2-layer.png\" width=400 alt=\"The 2 Layers in Exercise 1\">"
2626
]
2727
},
2828
{
2929
"cell_type": "markdown",
3030
"metadata": {},
3131
"source": [
3232
"\n",
33-
"To get started with Ground, we will use some of the \"Aboveground\" services that we have already developed. Aboveground services are tools that users use to interface with Ground at a higher semantic level than the simple node-and-edge-based API. We will first add the commit history of a git repository into Ground. Then, we'll tell Ground about some of the data that's contained in that repository. Finally, we will run the code contained in the repository. At the end of this section, we will see that Ground kept track of the code and the data; it will also track the execution of the code and understand the lineage between the original data and the cleaned data.\n",
33+
"To get started with Ground, we will use some \"Aboveground\" applications that were written for this tutorial. Aboveground applications allow users to interface with Ground at a higher semantic level than the general-purpose node-and-edge-based API. \n",
34+
"\n",
35+
"We will first add the commit history of a git repository into Ground. Then, we'll tell Ground about some of the data that's contained in that repository. Finally, we will run the code contained in the repository. At the end of this section, we will see that Ground kept track of the code and the data; it will also track the execution of the code and understand the lineage between the original data and the cleaned data.\n",
3436
"\n",
3537
"The cell below contains a call to the `ground_git_client`, which is an Aboveground app that interfaces with a Github repo. Run the cell below to capture and publish git information into Ground."
3638
]
@@ -74,7 +76,7 @@
7476
"cell_type": "markdown",
7577
"metadata": {},
7678
"source": [
77-
"Now that we have some code that Ground is aware of, we are going to want to do something with code. The particular repository that we populated has a simple **Python transformation script** that is \"Ground-aware\"\\*, as well a small amount of data for us to analyze in the form of a CSV file. You can find the repository online [here](https://github.com/ground-context/risecamp).\n",
79+
"Now that we have some code that Ground is aware of, we are going to want to do something with that code. The particular repository that we populated has a simple **Python transformation script** that is \"Ground-aware\"\\*, as well a small amount of data for us to analyze in the form of a CSV file. You can find the repository online [here](https://github.com/ground-context/risecamp).\n",
7880
"\n",
7981
"Next, we need to make sure that Ground knows about the base dataset that we are going to use. Using another Aboveground tool that we have already developed, we can automatically tell Ground about this new dataset. This tool will populate Ground with some useful information about the file, such as the file type, the size of the file, and the path to the file.\n",
8082
"\n",
@@ -99,9 +101,9 @@
99101
"cell_type": "markdown",
100102
"metadata": {},
101103
"source": [
102-
"Great! Now Ground knows about our base dataset. We see here all the information that Ground has about `data.txt`. Notice the ID that was assigned by Ground to this dataset. We're going to use it again in a minute.\n",
104+
"Great! Now Ground knows about our base dataset. In the output of the previous cell we see all the information that Ground has about `data.txt`. Notice the ID that was assigned by Ground to this dataset. We're going to use it again in a minute.\n",
103105
"\n",
104-
"Finally, we're going to run the transformation script in the repository. Since the script that we are using is Ground-aware, it is going to generate lineage automatically information in Ground as a part of transforming the data. It will tell Ground that it has created a new dataset based on the old input dataset, and it will associate this lineage information with the latest version of the source code that was used for the transformation."
106+
"Finally, we're going to run the transformation script in the repository. Since the script that we are using is Ground-aware, it is going to generate lineage information automatically in Ground as a part of transforming the data. It will tell Ground that it has created a new dataset based on the old input dataset, and it will associate this lineage information with the latest version of the source code that was used for the transformation."
105107
]
106108
},
107109
{
@@ -177,6 +179,18 @@
177179
"\n",
178180
"At this point you have seen how we can model application context in Ground as nodes and edges, and how behavioral context is captured through separate lineage edges."
179181
]
182+
},
183+
{
184+
"cell_type": "code",
185+
"execution_count": null,
186+
"metadata": {
187+
"collapsed": true
188+
},
189+
"outputs": [],
190+
"source": [
191+
"# if you run into any problems, run this cell to reset Ground\n",
192+
"!bash ../reset_ground.sh >> /dev/null"
193+
]
180194
}
181195
],
182196
"metadata": {

notebooks/Ground-02.ipynb Ground-02.ipynb

+21-7
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
"cell_type": "markdown",
2121
"metadata": {},
2222
"source": [
23-
"![The Full 3-Layer Data Model](images/3-layer.png)"
23+
"<img src=\"images/3-layer.png\", width=400, alt=\"The Full 3-Layer Data Model\"/>"
2424
]
2525
},
2626
{
@@ -36,7 +36,7 @@
3636
"source": [
3737
"Stepping into the specifics of our exercise, we provide some Aboveground tools that call into the Ground API, and allow us to easily interact with and debug ML models. Throughout this exercise, we will be using these higher-level tools instead of interacting directly with Ground. \n",
3838
"\n",
39-
"We're first going to look at the working version of our Twitter modeling pipeline and understand the context that is captured in Ground. Next, we'll see that the unexpected change in the feed format that causes a significant degradation in prediction quality. We will then use Ground to understand what changed & why and fix the bug."
39+
"We're first going to look at the working version of our Twitter modeling pipeline and understand the context that is captured in Ground. Second, we'll see an unexpected change in the feed format that causes a significant degradation in prediction quality. We will then use Ground to understand what changed & why and fix the bug."
4040
]
4141
},
4242
{
@@ -97,7 +97,7 @@
9797
"* `show_all_model_versions`: return the information about all the models Ground knows about.\n",
9898
"* `show_model_dependencies`: return the code & data dependencies for a particular model; if no **Ground version** is passed in, the default is assumed to be the most recent version of the model.\n",
9999
"* `show_data_schema`: prints out the schema of the dataset the model is trained on; if no **Ground version** is specified, prints out the most recent version of the schema.\n",
100-
"* `diff_deta_schemas`: takes in the **Ground versions** of two data schemas and prints out the differences between them."
100+
"* `diff_data_schemas`: takes in the **Ground versions** of two data schemas and prints out the differences between them."
101101
]
102102
},
103103
{
@@ -144,7 +144,7 @@
144144
"cell_type": "markdown",
145145
"metadata": {},
146146
"source": [
147-
"Okay, so we have a baseline model, and in fact it does pretty well: 60% accuracy. To see why that's pretty good, let's compare to what a *random guess* model would achieve. There are about 170 different countries that we have tweets from in this dataset. If we were to guess _totally_ at random, we would expect $\\frac{1}{170} = 0.6\\%$ accuracy. Even without machine learning, we can do better. As you might have guessed, the plurality of tweets in the world comes from the United States. Thus, we can say that the default case would be to always guess the United States, which would give us about 35% accuracy. (That is the proportion of tweets that comes from the United States.) \n",
147+
"At this point we have a baseline model, and in fact it does pretty well: 60% accuracy. To see why that's pretty good, let's compare to what a *random guess* model would achieve. There are about 170 different countries that we have tweets from in this dataset. If we were to guess _totally_ at random, we would expect $\\frac{1}{170} = 0.6\\%$ accuracy. Even without machine learning, we can do better. As you might have guessed, the plurality of tweets in the world come from the United States. Thus, we can say that the default case would be to always guess the United States, which would give us about 35% accuracy. (That is the proportion of tweets that comes from the United States.) \n",
148148
"\n",
149149
"Next, we're going to inspect the data context -- in particular, the lineage information -- to understand what exactly Ground has learned here. Follow the steps below."
150150
]
@@ -157,7 +157,7 @@
157157
},
158158
"outputs": [],
159159
"source": [
160-
"# first, let's retrieve the most recent model version to see what Ground knows\n",
160+
"# first, let's retrieve the most recent model version since that's the source of our score\n",
161161
"# we should see that there's a model number associated with each new version of the model"
162162
]
163163
},
@@ -232,9 +232,11 @@
232232
"cell_type": "markdown",
233233
"metadata": {},
234234
"source": [
235-
"Unfortunately, it seems to be true. Something has changed pretty significantly. We've gone from 60% accuracy down to 35%. Remember from above that this is no better than our intelligent random guess. It's clear that something external to our own pipeline has changed, causing the sudden drop from fairly accurate to no-better-than-random.\n",
235+
"Unfortunately, it seems to be true. Something has changed pretty significantly. We've gone from 60% accuracy down to 35%. Remember from above that this is no better than our intelligent random guess. We haven't changed anything in our pipeline, so something external must have changed, and caused the sudden drop from fairly accurate to no-better-than-random.\n",
236236
"\n",
237-
"The question we have to answer next is what changed that caused our prediction quality to degrade. We can imagine a long list of things that might have changed. If you're stuck, we've written a description below that will help walk you through the investigative steps. \n",
237+
"The question we have to answer next is what changed that caused our prediction quality to degrade. We can imagine a long list of things that might have changed, but the context stored in Ground will make it fairly easy to narrow our focus down to the real culprit. \n",
238+
"\n",
239+
"Your task is to use the aboveground helper functions listed above to identify the root causes of the degraded prediction quality, and remedy it (them). If you're stuck, we've written a description below that will help walk you through the investigative steps. \n",
238240
"\n",
239241
"**HINTS**: \n",
240242
"\n",
@@ -346,6 +348,18 @@
346348
"source": [
347349
"The full solution is provided [here](https://github.com/ground-context/tutorial/blob/master/solutions/Ground-02.ipynb)."
348350
]
351+
},
352+
{
353+
"cell_type": "code",
354+
"execution_count": null,
355+
"metadata": {
356+
"collapsed": true
357+
},
358+
"outputs": [],
359+
"source": [
360+
"# if you run into any problems, run this cell to reset Ground\n",
361+
"!bash ../reset_ground.sh >> /dev/null"
362+
]
349363
}
350364
],
351365
"metadata": {

notebooks/Ground-03.ipynb Ground-03.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
"cell_type": "markdown",
3737
"metadata": {},
3838
"source": [
39-
"Before we start writing our own Aboveground tool, let's first dig deeper into the workings of the `ground_file_client` from Exercise 1. Let's begin by opening the [`ground_file_client.py`](http://localhost:8888/edit/aboveground/ground_file_client.py) file in another tab. After walking through the comments there, return here to continue with the exercises."
39+
"Before we start writing our own Aboveground tool, let's first dig deeper into the workings of the `ground_file_client` from Exercise 1. Let's begin by opening the [`ground_file_client.py`](http://localhost:8888/edit/ground/aboveground/ground_file_client.py) file in another tab. After walking through the comments there, return here to continue with the exercises."
4040
]
4141
},
4242
{

README.md

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Ground Tutorials
2+
3+
This repository contains the infrastructure and Dockerfile for the introductory Ground tutorials. The first tutorial walks users through a simple instrumented analytics scenario; the second tutorial walks users through debugging an unexpected change in a machine learning pipeline. The last exercise will show users how to build their own Aboveground application.
4+
5+
A compiled version of the Docker image is on Dockerhub. You can pull it by running `docker pull groundcontext/tutorial`. To start the image, run `docker run --rm --it -p 8888:8888 groundcontext/tutorial`.

aboveground/__init__.py

Whitespace-only changes.

aboveground/ground_file_client.py

+61
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
from ground.client import GroundClient
2+
import os
3+
4+
gc = GroundClient()
5+
6+
def add_file(file_path):
7+
# get file system information
8+
stat = os.stat(file_path)
9+
10+
# get the file size
11+
size = stat[6]
12+
13+
# get the file creation time
14+
ctime = stat[-1]
15+
16+
# add the relevant information to the tags in Javascript
17+
tags = {
18+
'size': {
19+
'key': 'size',
20+
'value': size,
21+
'type': 'integer'
22+
},
23+
'ctime': {
24+
'key': 'ctime',
25+
'value': ctime,
26+
'type': 'integer'
27+
},
28+
'path': {
29+
'key': 'path',
30+
'value': file_path,
31+
'type': 'string'
32+
}
33+
}
34+
35+
# either retrieve an existing structure or create a new one
36+
sv_id = create_structure()
37+
38+
# get the name of the file
39+
file_path = file_path.split('/')[-1]
40+
41+
# create a new node
42+
node_id = gc.createNode(file_path, file_path, {})['id']
43+
44+
# create and return the node version; the empty array is the list of the
45+
# parent versions of this version (i.e., none)
46+
node_version = gc.createNodeVersion(node_id, tags=tags, structure_version_id=sv_id)
47+
return node_version
48+
49+
50+
def create_structure():
51+
# attempt to retrieve the structure
52+
struct = gc.getStructure("dataset")
53+
54+
if struct == None:
55+
# if it does not exist, create a new structure and structure version
56+
structure_id = gc.createStructure("dataset", "dataset", {})['id']
57+
return gc.createStructureVersion(structure_id, {"size": "integer", "ctime": "integer", "path": "string"})['id']
58+
else:
59+
# if it already exists, return the most recent version of it
60+
sv_id = gc.getStructureLatestVersions("dataset")[0]
61+
return gc.getStructureVersion(sv_id)

0 commit comments

Comments
 (0)