Skip to content

Files

Latest commit

8771815 · Dec 9, 2018

History

History
155 lines (111 loc) · 3.81 KB

31-ComparisonWithSQL.md

File metadata and controls

155 lines (111 loc) · 3.81 KB
Python Pandas - Comparison with SQL



Since many potential Pandas users have some familiarity with SQL, this page is meant to provide some examples of how various SQL operations can be performed using pandas.
import pandas as pd
url = 'https://raw.github.com/pandasdev/pandas/master/pandas/tests/data/tips.csv'
tips=pd.read_csv(url)
print tips.head()
Its output is as followstotal_bill   tip      sex  smoker  day     time  size
0        16.99  1.01   Female      No  Sun  Dinner      2
1        10.34  1.66     Male      No  Sun  Dinner      3
2        21.01  3.50     Male      No  Sun  Dinner      3
3        23.68  3.31     Male      No  Sun  Dinner      2
4        24.59  3.61   Female      No  Sun  Dinner      4
SELECT
In SQL, selection is done using a comma-separated list of columns that you select (or a * to select all columns) −
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
With Pandas, column selection is done by passing a list of column names to your DataFrametips[['total_bill', 'tip', 'smoker', 'time']].head(5)
Lets check the full program
import pandas as pd

url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'
 
tips=pd.read_csv(url)
print tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
Its output is as followstotal_bill   tip  smoker     time
0       16.99  1.01      No   Dinner
1       10.34  1.66      No   Dinner
2       21.01  3.50      No   Dinner
3       23.68  3.31      No   Dinner
4       24.59  3.61      No   Dinner
Calling the DataFrame without the list of column names will display all columns (akin to SQLs *).
WHERE
Filtering in SQL is done via a WHERE clause.
  SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;
DataFrames can be filtered in multiple ways; the most intuitive of which is using Boolean indexing.
  tips[tips['time'] == 'Dinner'].head(5)
Lets check the full programimport pandas as pd

url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'

tips=pd.read_csv(url)
print tips[tips['time'] == 'Dinner'].head(5)
Its output is as followstotal_bill   tip      sex  smoker  day    time  size
0       16.99  1.01   Female     No   Sun  Dinner    2
1       10.34  1.66     Male     No   Sun  Dinner    3
2       21.01  3.50     Male     No   Sun  Dinner    3
3       23.68  3.31     Male     No   Sun  Dinner    2
4       24.59  3.61   Female     No   Sun  Dinner    4
The above statement passes a Series of True/False objects to the DataFrame, returning all rows with True.
GroupBy
This operation fetches the count of records in each group throughout a dataset. For instance, a query fetching us the number of tips left by sexSELECT sex, count(*)
FROM tips
GROUP BY sex;
The Pandas equivalent would betips.groupby('sex').size()
Lets check the full program
import pandas as pd

url = 'https://raw.github.com/pandasdev/
pandas/master/pandas/tests/data/tips.csv'

tips=pd.read_csv(url)
print tips.groupby('sex').size()
Its output is as followssex
Female   87
Male    157
dtype: int64
Top N rows
SQL returns the top n rows using LIMITSELECT * FROM tips
LIMIT 5 ;
The Pandas equivalent would betips.head(5)
Lets check the full exampleimport pandas as pd
url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'

tips=pd.read_csv(url)
tips = tips[['smoker', 'day', 'time']].head(5)
print tips
Its output is as followssmoker   day     time
0      No   Sun   Dinner
1      No   Sun   Dinner
2      No   Sun   Dinner
3      No   Sun   Dinner
4      No   Sun   Dinner
These are the few basic operations we compared are, which we learnt, in the previous chapters of the Pandas Library.