Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,028 changes: 3,028 additions & 0 deletions .ipynb_checkpoints/DataExploration-checkpoint.ipynb

Large diffs are not rendered by default.

392 changes: 392 additions & 0 deletions .ipynb_checkpoints/model_iteration_1-checkpoint.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,392 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Shivali Chandra <br>\n",
"First iteration of model for Titanic Kaggle dataset. <br>\n",
"1/27/16 <br>\n",
"Initial Score: 0.75120 <br>\n",
"Random forests Score: 0.7535 <br>\n",
"Adding more columns Score: 0.7799"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First steps: load libraries used, read from training file and show basic statistics of file"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>714.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>446.000000</td>\n",
" <td>0.383838</td>\n",
" <td>2.308642</td>\n",
" <td>29.699118</td>\n",
" <td>0.523008</td>\n",
" <td>0.381594</td>\n",
" <td>32.204208</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>257.353842</td>\n",
" <td>0.486592</td>\n",
" <td>0.836071</td>\n",
" <td>14.526497</td>\n",
" <td>1.102743</td>\n",
" <td>0.806057</td>\n",
" <td>49.693429</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.420000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>223.500000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>20.125000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>7.910400</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>446.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>28.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>14.454200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>668.500000</td>\n",
" <td>1.000000</td>\n",
" <td>3.000000</td>\n",
" <td>38.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>31.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>891.000000</td>\n",
" <td>1.000000</td>\n",
" <td>3.000000</td>\n",
" <td>80.000000</td>\n",
" <td>8.000000</td>\n",
" <td>6.000000</td>\n",
" <td>512.329200</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Age SibSp \\\n",
"count 891.000000 891.000000 891.000000 714.000000 891.000000 \n",
"mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n",
"std 257.353842 0.486592 0.836071 14.526497 1.102743 \n",
"min 1.000000 0.000000 1.000000 0.420000 0.000000 \n",
"25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n",
"50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n",
"75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n",
"max 891.000000 1.000000 3.000000 80.000000 8.000000 \n",
"\n",
" Parch Fare \n",
"count 891.000000 891.000000 \n",
"mean 0.381594 32.204208 \n",
"std 0.806057 49.693429 \n",
"min 0.000000 0.000000 \n",
"25% 0.000000 7.910400 \n",
"50% 0.000000 14.454200 \n",
"75% 0.000000 31.000000 \n",
"max 6.000000 512.329200 "
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"from sklearn.linear_model import LinearRegression, LogisticRegression\n",
"from sklearn.cross_validation import KFold\n",
"from sklearn import cross_validation\n",
"import numpy as np\n",
"\n",
"titanic = pd.read_csv(\"train.csv\")\n",
"titanic.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cleaning code, filling in NaN values and replacing text values with number codes: "
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())\n",
"\n",
"titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0\n",
"titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1\n",
"\n",
"titanic['Embarked'] = titanic['Embarked'].fillna('S')\n",
"titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0\n",
"titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1\n",
"titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Defining columns used to predict target, generating cross validation folds for the dataset (with random state set to ensure splits are the same every time), initializing predictors and target, training algorithm, and making predictions:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.78787878787878773"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test = pd.read_csv('test.csv')\n",
"test['Age'] = test['Age'].fillna(titanic['Age'].median())\n",
"\n",
"test.loc[test['Sex'] == 'male', 'Sex'] = 0\n",
"test.loc[test['Sex'] == 'female', 'Sex'] = 1\n",
"\n",
"test['Embarked'] = test['Embarked'].fillna('S')\n",
"test.loc[test['Embarked'] == 'S', 'Embarked'] = 0\n",
"test.loc[test['Embarked'] == 'C', 'Embarked'] = 1\n",
"test.loc[test['Embarked'] == 'Q', 'Embarked'] = 2\n",
"\n",
"test['Fare'] = test['Fare'].fillna(titanic['Fare'].median())\n",
"\n",
"alg = LogisticRegression(random_state=1)\n",
"scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=3)\n",
"scores.mean()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.819304152637\n"
]
}
],
"source": [
"from sklearn import cross_validation\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch']\n",
"titanic['NameLength'] = titanic['Name'].apply(lambda x: len(x))\n",
"test['FamilySize'] = test['SibSp'] + test['Parch']\n",
"test['NameLength'] = test['Name'].apply(lambda x: len(x))\n",
"\n",
"predictors = [\"Pclass\", \"Sex\", \"Age\", \"SibSp\", \"Parch\", \"Fare\", \"Embarked\"]\n",
"\n",
"# Initialize our algorithm with the default paramters\n",
"# n_estimators is the number of trees we want to make\n",
"# min_samples_split is the minimum number of rows we need to make a split\n",
"# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)\n",
"alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=10, min_samples_leaf=5)\n",
"# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)\n",
"scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic[\"Survived\"], cv=3)\n",
"\n",
"# Take the mean of the scores (because we have one for each fold)\n",
"print(scores.mean())"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']\n",
"\n",
"alg = LinearRegression()\n",
"kf = KFold(titanic.shape[0], n_folds=3, random_state=1)\n",
"scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=3)\n",
"scores.mean()\n",
"\n",
"predictions = []\n",
"for train, test in kf:\n",
" train_predictors = (titanic[predictors].iloc[train,:])\n",
" train_target = titanic['Survived'].iloc[train]\n",
" alg.fit(train_predictors, train_target)\n",
" test_predictions = alg.predict(titanic[predictors].iloc[test,:])\n",
" predictions.append(test_predictions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Concatenating three prediction np arrays into one, and mapping the predictions to outcomes. Then, calculating the accuracy: "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"predictions = np.concatenate(predictions, axis=0)\n",
"\n",
"predictions[predictions > 0.5] = 1\n",
"predictions[predictions <= 0.5] = 0\n",
"\n",
"accuracy = sum(predictions[predictions == titanic['Survived']]) / len(predictions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Computing accuracy score for all cross validation folds, and taking mean of scores"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cleaning test data. Filling missing NaN values and replacing text values with number codes: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generating submission for competition - training algorithm, making predictions, and creating dataframe with the columns needed: "
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"alg.fit(titanic[predictors], titanic['Survived'])\n",
"\n",
"predictions = alg.predict(test[predictors])\n",
"\n",
"submission = pd.DataFrame({\n",
" 'PassengerId': test['PassengerId'],\n",
" 'Survived': predictions\n",
" })"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"submission.to_csv('kaggle.csv', index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading