amygdalama.github.io/ipython-and-pandas-are-whoa.html at master · munindranath/amygdalama.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
<!DOCTYPE html>
<html lang="en">
<head>
        <title>ipython and pandas are whoa!</title>
        <meta charset="utf-8" />
        <link rel="stylesheet" href="./theme/css/main.css" type="text/css" />
        <link href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" rel="stylesheet">


        <!--[if IE]>
                <script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]-->

        <!--[if lte IE 7]>
                <link rel="stylesheet" type="text/css" media="all" href="./css/ie.css"/>
                <script src="./js/IE8.js" type="text/javascript"></script><![endif]-->

        <!--[if lt IE 7]>
                <link rel="stylesheet" type="text/css" media="all" href="./css/ie6.css"/><![endif]-->

</head>

<body id="index" class="home">
        <header id="banner" class="body">
                <h1><a href=".">Amy Hanlon </a></h1>
                <nav><ul>
                    <li><a href=".">Home</a></li>
                    <li><a href="./pages/about.html">About</a></li>
                    <li><a href="./pages/contact.html">Contact</a></li>
                    <li><a href="./pages/projects.html">Projects</a></li>
                    <li><a href="./pages/talks.html">Talks</a></li>
                </ul></nav>

        </header><!-- /#banner -->
<section id="content" class="body">
  <article>
    <header>
      <h1 class="entry-title">
        <a href="ipython-and-pandas-are-whoa.html" rel="bookmark"
           title="Permalink to ipython and pandas are whoa!">ipython and pandas are whoa!</a></h1>
    </header>

    <div class="entry-content">
<footer class="post-info">
        <abbr class="published" title="2014-02-14T15:52:00">
                February 14 2014
        </abbr>

<p><!-- In <a href="./category/projects.html">Projects</a>  --></p>


</footer><!-- /.post-info -->      <p>In my first five months of using Python, I would create .py files and
then execute them in the terminal, like the loyal follower of <a href="http://learnpythonthehardway.org/book/">LPTHW</a>
that I am. While it's certainly necessary to be able to write Python
scripts and execute them from the terminal, I've learned that this
workflow isn't the best way to explore data in Python.</p>
<p>I initially converted to Python from R, and since have missed the
interactiveness of working within RStudio (you don't have to re-run your
entire code every time you make a change and want to view the results)
and how R allows you to so quickly get to your data with data frames
built straight from text files. Enter: <a href="http://ipython.org/">ipython</a> and <a href="http://pandas.pydata.org/">pandas</a>. (Go
to their respective sites for installation instructions.)</p>
<p>Julia Evans has a fantastic <a href="https://github.com/jvns/pandas-cookbook">tutorial</a> on how to use pandas in an
ipython notebook. Here, I'd like to explore the differences between
doing data analysis using ipython notebook and pandas versus writing .py
scripts using mainly standard Python packages and executing via the
terminal.</p>
<p>Let's use the classic iris dataset for some simple analysis. First we
need to do some setup. Create a directory for our project (mine is
stored within my Projects directory that I keep in my Home directory.)
In the terminal:</p>
<div class="highlight"><pre><span class="nv">$ </span>mkdir ~/Projects/ipython-pandas-whoa
<span class="nv">$ </span><span class="nb">cd</span> ~/Projects
</pre></div>


<p>Then create a Python script and open it with a text editor (I use
Sublime Text 2, which I've <a href="http://www.sublimetext.com/docs/2/osx_command_line.html">configured</a> to open files with the command
<code>subl</code>.</p>
<div class="highlight"><pre><span class="nv">$ </span>touch barbaric-script.py
<span class="nv">$ </span>subl barbaric-script.py
</pre></div>


<p>This should open up a blank .py file in Sublime. If you don't have
<code>subl</code> set up, you can simply open up the blank barbaric-script.py file
we created with the <code>touch</code> command in your text editor of choice.</p>
<p>In the barbaric-script.py file, let's import the necessary packages.
We'll be using a couple standard python packages plus matplotlib.pyplot.
If you don't have matplotlib installed, you can find installation
instructions <a href="http://matplotlib.org/users/installing.html">here</a>.</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">csv</span>
<span class="kn">import</span> <span class="nn">pprint</span>
<span class="kn">import</span> <span class="nn">urllib2</span>

<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
</pre></div>


<p>To get the iris data, we'll use Python's built-in urllib2 and csv
packages and then do some basic formatting. Let's process the dataset as
a list of dictionaries.</p>
<div class="highlight"><pre><span class="n">url</span> <span class="o">=</span> <span class="s">&quot;http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.data&quot;</span>

<span class="c"># Open data from URL as a file-like object</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>

<span class="c"># Create an empty list to store our data</span>
<span class="n">parsed_data</span> <span class="o">=</span> <span class="p">[]</span>

<span class="c"># Read file</span>
<span class="n">raw_data</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">reader</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

<span class="c"># Define our headers since the file doesn&#39;t contain explicit headers</span>
<span class="c"># I found these headers from looking at the documentation at</span>
<span class="c"># http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.names</span>
<span class="n">headers</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;Sepal Length&#39;</span><span class="p">,</span> <span class="s">&#39;Sepal Width&#39;</span><span class="p">,</span> <span class="s">&#39;Petal Length&#39;</span><span class="p">,</span> <span class="s">&#39;Petal Width&#39;</span><span class="p">,</span> <span class="s">&#39;Class&#39;</span>
    <span class="p">]</span>

<span class="c"># Iterate through the rows in the file, and create a dictionary for each row</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">raw_data</span><span class="p">:</span>

<span class="c"># Dictionaries should have headers -&gt; row</span>
<span class="n">parsed_data</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">headers</span><span class="p">,</span> <span class="n">row</span><span class="p">)))</span>

<span class="c"># Delete the last row of parsed_data because it&#39;s blank</span>
<span class="n">parsed_data</span> <span class="o">=</span> <span class="n">parsed_data</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>

<span class="c"># Let&#39;s see what parsed_data looks like</span>
<span class="n">pprint</span><span class="o">.</span><span class="n">pprint</span><span class="p">(</span><span class="n">parsed_data</span><span class="p">[:</span><span class="mi">3</span><span class="p">])</span>
</pre></div>


<p>What you should see when you execute barbaric-script.py:</p>
<div class="highlight"><pre><span class="gp">$</span> python barbaric-script.py
<span class="go">[{&#39;Class&#39;: &#39;Iris-setosa&#39;,</span>
<span class="go">&#39;Petal Length&#39;: &#39;1.4&#39;,</span>
<span class="go">&#39;Petal Width&#39;: &#39;0.2&#39;,</span>
<span class="go">&#39;Sepal Length&#39;: &#39;5.1&#39;,</span>
<span class="go">&#39;Sepal Width&#39;: &#39;3.5&#39;},</span>
<span class="go">{&#39;Class&#39;: &#39;Iris-setosa&#39;,</span>
<span class="go">&#39;Petal Length&#39;: &#39;1.4&#39;,</span>
<span class="go">&#39;Petal Width&#39;: &#39;0.2&#39;,</span>
<span class="go">&#39;Sepal Length&#39;: &#39;4.9&#39;,</span>
<span class="go">&#39;Sepal Width&#39;: &#39;3.0&#39;},</span>
<span class="go">{&#39;Class&#39;: &#39;Iris-setosa&#39;,</span>
<span class="go">&#39;Petal Length&#39;: &#39;1.3&#39;,</span>
<span class="go">&#39;Petal Width&#39;: &#39;0.2&#39;,</span>
<span class="go">&#39;Sepal Length&#39;: &#39;4.7&#39;,</span>
<span class="go">&#39;Sepal Width&#39;: &#39;3.2&#39;}]</span>
</pre></div>


<p>Great! So we have our data formatted nicely. We have a list with
elements corresponding to each row in the file, and the elements in the
list are dictionaries. The keys in the dictionaries are the column
headers, and the values are the values associated with the respective
column header for that particular row. For example, the first
row/observation (printed as the first dictionary out of the three listed
above) is in the class <code>Iris-setosa</code>, has a Petal Length of <code>1.4</code>, a Petal
Width of <code>0.2</code>, a Sepal Length of <code>5.1</code>, and a Sepal Width of <code>3.5</code>. Let's do
some plotting to explore our data. We can make a histogram! Add the
following lines to your barbaric-script.py file:</p>
<div class="highlight"><pre><span class="c"># Let&#39;s create a list of the Sepal Lengths</span>
<span class="c"># I&#39;m calling float on the entries because otherwise they&#39;re stored as strings</span>
<span class="n">sepal_lengths</span> <span class="o">=</span> <span class="p">[</span><span class="nb">float</span><span class="p">(</span><span class="n">parsed_data</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s">&#39;Sepal Length&#39;</span><span class="p">])</span>
        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">parsed_data</span><span class="p">))]</span>

<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">sepal_lengths</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">&#39;#348ABD&#39;</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s">&#39;none&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">&#39;Sepal Length&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">&#39;Count&#39;</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</pre></div>


<p>Let's try executing our script again.</p>
<div class="highlight"><pre><span class="nv">$ </span>python barbaric-script.py
</pre></div>


<p>A histogram should pop up in a separate window:</p>
<p><img alt="alt text" src="https://raw2.github.com/amygdalama/amygdalama.github.io/master/images/barbaric.png" /></p>
<p>That's a lot of code (18 lines)! And we didn't even do anything
complicated - just parsed our data and plotted a histogram. Also, notice
that each time you tried to fix a bug or add a feature (like plotting
the histogram), you had to execute the entire script again, rather than
just the piece where you fixed the bug. That doesn't matter too much
since our data is fairly small and the script doesn't require much time
to execute, but it would be pretty annoying if our script took longer to
run.</p>
<p>Now let's try doing the same thing, but using pandas in an ipython
notebook. Use the terminal to run ipython notebook:</p>
<div class="highlight"><pre><span class="nv">$ </span><span class="nb">cd</span> ~/Projects/ipython-pandas-whoa
<span class="nv">$ </span>ipython notebook
</pre></div>


<p>This will set up a local server on your computer, which will serve the
ipython notebooks that are in your working directory (right now you
don't have any) to http://127.0.0.1:8888. This should automatically open
up in your browser.</p>
<p>In the tab that opens in your browser, click "New Notebook". A new
ipython notebook will open up in another tab. A new cell will also open
up. Cells are places where you can write and execute code. This is what
a cell looks like:</p>
<p><img alt="alt text" src="https://raw2.github.com/amygdalama/amygdalama.github.io/master/images/ipythoncell.png" /></p>
<p>To allow matplotlib to post plots within your notebook, rather than
opening up a separate window (like when we executed our .py script),
type the following into the cell:</p>
<div class="highlight"><pre><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
</pre></div>


<p>Then to execute the cell, press shift+enter (which will execute the cell
your cursor is currently in and then automatically either move to the
next cell, if one exists, or open up a new cell). Once you press
shift+enter, your notebook should look like:</p>
<p><img alt="alt text" src="https://raw2.github.com/amygdalama/amygdalama.github.io/master/images/ipythoncell.png" /></p>
<p>In the next cell, import the packages we'll be using, namely
matplotlib.pyplot and pandas:</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
</pre></div>


<p>Press shift+enter again to execute the code (I'm going to stop telling
you to do this, but just assume that after each block of code you should
press shift+enter. If you get confused about what your notebook should
look like, look at my <a href="http://nbviewer.ipython.org/github/amygdalama/ipython-pandas-whoa/blob/master/ipython-pandas-whoa.ipynb">example notebook</a>.)</p>
<p>To process the iris data, we can use the built-in panda read_csv
parser. We can get to a easy-to-use data frame within three lines of
code!</p>
<div class="highlight"><pre><span class="n">url</span> <span class="o">=</span> <span class="s">&quot;http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.data&quot;</span>

<span class="c"># Define our headers since the url doesn&#39;t contain explicit headers</span>
<span class="c"># I found these headers from looking at the documentation at</span>
<span class="c"># http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.names</span>
<span class="n">headers</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;Sepal Length&#39;</span><span class="p">,</span> <span class="s">&#39;Sepal Width&#39;</span><span class="p">,</span> <span class="s">&#39;Petal Length&#39;</span><span class="p">,</span> <span class="s">&#39;Petal Width&#39;</span><span class="p">,</span> <span class="s">&#39;Class&#39;</span><span class="p">]</span>
<span class="n">iris</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="n">headers</span><span class="p">)</span>
</pre></div>


<p>Let's see what the data looks like.</p>
<div class="highlight"><pre><span class="n">iris</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span>
</pre></div>


<p>You should see a beautifully formatted table! This is a pandas data
frame, which is similar to a data frame in R.</p>
<p><img alt="alt text" src="https://raw2.github.com/amygdalama/amygdalama.github.io/master/images/table.png" /></p>
<p>Now let's plot a histogram for the Sepal Length column.</p>
<div class="highlight"><pre><span class="c"># I use two brackets around &#39;Sepal Length&#39; to force pandas to make this</span>
<span class="c"># a data frame rather than just a series, which is like a numpy array.</span>
<span class="c"># The brackets here aren&#39;t necessary, but makes printing sepal_lengths</span>
<span class="c"># prettier and makes it easier for us to combine sepal_lengths with other</span>
<span class="c"># data.</span>
<span class="n">sepal_lengths</span> <span class="o">=</span> <span class="n">iris</span><span class="p">[[</span><span class="s">&#39;Sepal Length&#39;</span><span class="p">]]</span>

<span class="c"># Make the plot pretty!</span>
<span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s">&#39;display.mpl_style&#39;</span><span class="p">,</span> <span class="s">&#39;default&#39;</span><span class="p">)</span>

<span class="n">sepal_lengths</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span>
</pre></div>


<p><img alt="alt text" src="https://raw2.github.com/amygdalama/amygdalama.github.io/master/images/quite_elegant.png" /></p>
<p>By this point it should be fairly obvious that ipython and pandas are
both awesome for data analysis! They make Python a much more serious
contender as a tool for data analysis by giving you quick and easy
access to your data. In this simple example, the barbaric-script.py was
18 lines of code (disregarding comments, etc) and comparatively the
ipython notebook with pandas was only 10!</p>
<p>For further reading, I definitely encourage you to check out Julia's
<a href="https://github.com/jvns/pandas-cookbook">pandas cookbook</a> (presented via ipython notebooks), on which
incidentally I'll be collaborating with her next week (eeeep)!</p>
<p>My code for this post can be found on my <a href="https://github.com/amygdalama/ipython-pandas-whoa">GitHub</a>, and the ipython
notebook can be easily viewed on <a href="http://nbviewer.ipython.org/github/amygdalama/ipython-pandas-whoa/blob/master/ipython-pandas-whoa.ipynb">NBViewer</a>.</p>
      <p>
    <!-- <p> -->
        tags:
           <a href="./tag/data-analysis.html">data analysis</a>
           <a href="./tag/hacker-school.html">hacker school</a>
           <a href="./tag/ipython.html">ipython</a>
           <a href="./tag/matplotlib.html">matplotlib</a>
           <a href="./tag/pandas.html">pandas</a>
           <a href="./tag/python.html">python</a>
    <!-- </p> -->

      </p>
    </div><!-- /.entry-content -->
    <br>
    <div class="comments">
      <h2 class = "other-articles">Comments</h2>
      <div id="disqus_thread"></div>
      <script type="text/javascript">
        var disqus_identifier = "ipython-and-pandas-are-whoa.html";
        (function() {
        var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
        dsq.src = 'http://mathamy.disqus.com/embed.js';
        (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
        })();
      </script>
    </div>

  </article>
</section>
        <section id="extras" class="body">
        </section><!-- /#extras -->

        <footer id="contentinfo" class="body">

        </footer><!-- /#contentinfo -->

    <script type="text/javascript">
    var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
    document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
    </script>
    <script type="text/javascript">
    try {
        var pageTracker = _gat._getTracker("UA-48330831-1");
    pageTracker._trackPageview();
    } catch(err) {}</script>
<script type="text/javascript">
    var disqus_shortname = 'mathamy';
    (function () {
        var s = document.createElement('script'); s.async = true;
        s.type = 'text/javascript';
        s.src = 'http://' + disqus_shortname + '.disqus.com/count.js';
        (document.getElementsByTagName('HEAD')[0] || document.getElementsByTagName('BODY')[0]).appendChild(s);
    }());
</script>
</body>
</html>