Kiki Chandra's submission (forgot to submit pull request earlier) #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

skchandra wants to merge 7 commits into paulruvolo:master from skchandra:master

.ipynb_checkpoints/DataExploration-checkpoint.ipynb

Large diffs are not rendered by default.

.ipynb_checkpoints/model_iteration_1-checkpoint.ipynb

-Original file line number
+Diff line change
@@ -0,0 +1,392 @@
+    {
+     "cells": [
+      {
+       "cell_type": "markdown",
+       "metadata": {},
+       "source": [
+        "Shivali Chandra <br>\n",
+        "First iteration of model for Titanic Kaggle dataset. <br>\n",
+        "1/27/16 <br>\n",
+        "Initial Score: 0.75120 <br>\n",
+        "Random forests Score: 0.7535 <br>\n",
+        "Adding more columns Score: 0.7799"
+       ]
+      },
+      {
+       "cell_type": "markdown",
+       "metadata": {},
+       "source": [
+        "First steps: load libraries used, read from training file and show basic statistics of file"
+       ]
+      },
+      {
+       "cell_type": "code",
+       "execution_count": 46,
+       "metadata": {
+        "collapsed": false
+       },
+       "outputs": [
+        {
+         "data": {
+          "text/html": [
+           "<div>\n",
+           "<table border=\"1\" class=\"dataframe\">\n",
+           "  <thead>\n",
+           "    <tr style=\"text-align: right;\">\n",
+           "      <th></th>\n",
+           "      <th>PassengerId</th>\n",
+           "      <th>Survived</th>\n",
+           "      <th>Pclass</th>\n",
+           "      <th>Age</th>\n",
+           "      <th>SibSp</th>\n",
+           "      <th>Parch</th>\n",
+           "      <th>Fare</th>\n",
+           "    </tr>\n",
+           "  </thead>\n",
+           "  <tbody>\n",
+           "    <tr>\n",
+           "      <th>count</th>\n",
+           "      <td>891.000000</td>\n",
+           "      <td>891.000000</td>\n",
+           "      <td>891.000000</td>\n",
+           "      <td>714.000000</td>\n",
+           "      <td>891.000000</td>\n",
+           "      <td>891.000000</td>\n",
+           "      <td>891.000000</td>\n",
+           "    </tr>\n",
+           "    <tr>\n",
+           "      <th>mean</th>\n",
+           "      <td>446.000000</td>\n",
+           "      <td>0.383838</td>\n",
+           "      <td>2.308642</td>\n",
+           "      <td>29.699118</td>\n",
+           "      <td>0.523008</td>\n",
+           "      <td>0.381594</td>\n",
+           "      <td>32.204208</td>\n",
+           "    </tr>\n",
+           "    <tr>\n",
+           "      <th>std</th>\n",
+           "      <td>257.353842</td>\n",
+           "      <td>0.486592</td>\n",
+           "      <td>0.836071</td>\n",
+           "      <td>14.526497</td>\n",
+           "      <td>1.102743</td>\n",
+           "      <td>0.806057</td>\n",
+           "      <td>49.693429</td>\n",
+           "    </tr>\n",
+           "    <tr>\n",
+           "      <th>min</th>\n",
+           "      <td>1.000000</td>\n",
+           "      <td>0.000000</td>\n",
+           "      <td>1.000000</td>\n",
+           "      <td>0.420000</td>\n",
+           "      <td>0.000000</td>\n",
+           "      <td>0.000000</td>\n",
+           "      <td>0.000000</td>\n",
+           "    </tr>\n",
+           "    <tr>\n",
+           "      <th>25%</th>\n",
+           "      <td>223.500000</td>\n",
+           "      <td>0.000000</td>\n",
+           "      <td>2.000000</td>\n",
+           "      <td>20.125000</td>\n",
+           "      <td>0.000000</td>\n",
+           "      <td>0.000000</td>\n",
+           "      <td>7.910400</td>\n",
+           "    </tr>\n",
+           "    <tr>\n",
+           "      <th>50%</th>\n",
+           "      <td>446.000000</td>\n",
+           "      <td>0.000000</td>\n",
+           "      <td>3.000000</td>\n",
+           "      <td>28.000000</td>\n",
+           "      <td>0.000000</td>\n",
+           "      <td>0.000000</td>\n",
+           "      <td>14.454200</td>\n",
+           "    </tr>\n",
+           "    <tr>\n",
+           "      <th>75%</th>\n",
+           "      <td>668.500000</td>\n",
+           "      <td>1.000000</td>\n",
+           "      <td>3.000000</td>\n",
+           "      <td>38.000000</td>\n",
+           "      <td>1.000000</td>\n",
+           "      <td>0.000000</td>\n",
+           "      <td>31.000000</td>\n",
+           "    </tr>\n",
+           "    <tr>\n",
+           "      <th>max</th>\n",
+           "      <td>891.000000</td>\n",
+           "      <td>1.000000</td>\n",
+           "      <td>3.000000</td>\n",
+           "      <td>80.000000</td>\n",
+           "      <td>8.000000</td>\n",
+           "      <td>6.000000</td>\n",
+           "      <td>512.329200</td>\n",
+           "    </tr>\n",
+           "  </tbody>\n",
+           "</table>\n",
+           "</div>"
+          ],
+          "text/plain": [
+           "       PassengerId    Survived      Pclass         Age       SibSp  \\\n",
+           "count   891.000000  891.000000  891.000000  714.000000  891.000000   \n",
+           "mean    446.000000    0.383838    2.308642   29.699118    0.523008   \n",
+           "std     257.353842    0.486592    0.836071   14.526497    1.102743   \n",
+           "min       1.000000    0.000000    1.000000    0.420000    0.000000   \n",
+           "25%     223.500000    0.000000    2.000000   20.125000    0.000000   \n",
+           "50%     446.000000    0.000000    3.000000   28.000000    0.000000   \n",
+           "75%     668.500000    1.000000    3.000000   38.000000    1.000000   \n",
+           "max     891.000000    1.000000    3.000000   80.000000    8.000000   \n",
+           "\n",
+           "            Parch        Fare  \n",
+           "count  891.000000  891.000000  \n",
+           "mean     0.381594   32.204208  \n",
+           "std      0.806057   49.693429  \n",
+           "min      0.000000    0.000000  \n",
+           "25%      0.000000    7.910400  \n",
+           "50%      0.000000   14.454200  \n",
+           "75%      0.000000   31.000000  \n",
+           "max      6.000000  512.329200  "
+          ]
+         },
+         "execution_count": 46,
+         "metadata": {},
+         "output_type": "execute_result"
+        }
+       ],
+       "source": [
+        "import pandas as pd\n",
+        "from sklearn.linear_model import LinearRegression, LogisticRegression\n",
+        "from sklearn.cross_validation import KFold\n",
+        "from sklearn import cross_validation\n",
+        "import numpy as np\n",
+        "\n",
+        "titanic = pd.read_csv(\"train.csv\")\n",
+        "titanic.describe()"
+       ]
+      },
+      {
+       "cell_type": "markdown",
+       "metadata": {},
+       "source": [
+        "Cleaning code, filling in NaN values and replacing text values with number codes: "
+       ]
+      },
+      {
+       "cell_type": "code",
+       "execution_count": 47,
+       "metadata": {
+        "collapsed": true
+       },
+       "outputs": [],
+       "source": [
+        "titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())\n",
+        "\n",
+        "titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0\n",
+        "titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1\n",
+        "\n",
+        "titanic['Embarked'] = titanic['Embarked'].fillna('S')\n",
+        "titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0\n",
+        "titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1\n",
+        "titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2"
+       ]
+      },
+      {
+       "cell_type": "markdown",
+       "metadata": {},
+       "source": [
+        "Defining columns used to predict target, generating cross validation folds for the dataset (with random state set to ensure splits are the same every time), initializing predictors and target, training algorithm, and making predictions:"
+       ]
+      },
+      {
+       "cell_type": "code",
+       "execution_count": 48,
+       "metadata": {
+        "collapsed": false
+       },
+       "outputs": [
+        {
+         "data": {
+          "text/plain": [
+           "0.78787878787878773"
+          ]
+         },
+         "execution_count": 48,
+         "metadata": {},
+         "output_type": "execute_result"
+        }
+       ],
+       "source": [
+        "test = pd.read_csv('test.csv')\n",
+        "test['Age'] = test['Age'].fillna(titanic['Age'].median())\n",
+        "\n",
+        "test.loc[test['Sex'] == 'male', 'Sex'] = 0\n",
+        "test.loc[test['Sex'] == 'female', 'Sex'] = 1\n",
+        "\n",
+        "test['Embarked'] = test['Embarked'].fillna('S')\n",
+        "test.loc[test['Embarked'] == 'S', 'Embarked'] = 0\n",
+        "test.loc[test['Embarked'] == 'C', 'Embarked'] = 1\n",
+        "test.loc[test['Embarked'] == 'Q', 'Embarked'] = 2\n",
+        "\n",
+        "test['Fare'] = test['Fare'].fillna(titanic['Fare'].median())\n",
+        "\n",
+        "alg = LogisticRegression(random_state=1)\n",
+        "scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=3)\n",
+        "scores.mean()"
+       ]
+      },
+      {
+       "cell_type": "code",
+       "execution_count": 49,
+       "metadata": {
+        "collapsed": false
+       },
+       "outputs": [
+        {
+         "name": "stdout",
+         "output_type": "stream",
+         "text": [
+          "0.819304152637\n"
+         ]
+        }
+       ],
+       "source": [
+        "from sklearn import cross_validation\n",
+        "from sklearn.ensemble import RandomForestClassifier\n",
+        "\n",
+        "titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch']\n",
+        "titanic['NameLength'] = titanic['Name'].apply(lambda x: len(x))\n",
+        "test['FamilySize'] = test['SibSp'] + test['Parch']\n",
+        "test['NameLength'] = test['Name'].apply(lambda x: len(x))\n",
+        "\n",
+        "predictors = [\"Pclass\", \"Sex\", \"Age\", \"SibSp\", \"Parch\", \"Fare\", \"Embarked\"]\n",
+        "\n",
+        "# Initialize our algorithm with the default paramters\n",
+        "# n_estimators is the number of trees we want to make\n",
+        "# min_samples_split is the minimum number of rows we need to make a split\n",
+        "# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)\n",
+        "alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=10, min_samples_leaf=5)\n",
+        "# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)\n",
+        "scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic[\"Survived\"], cv=3)\n",
+        "\n",
+        "# Take the mean of the scores (because we have one for each fold)\n",
+        "print(scores.mean())"
+       ]
+      },
+      {
+       "cell_type": "markdown",
+       "metadata": {
+        "collapsed": false
+       },
+       "source": [
+        "predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']\n",
+        "\n",
+        "alg = LinearRegression()\n",
+        "kf = KFold(titanic.shape[0], n_folds=3, random_state=1)\n",
+        "scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=3)\n",
+        "scores.mean()\n",
+        "\n",
+        "predictions = []\n",
+        "for train, test in kf:\n",
+        "    train_predictors = (titanic[predictors].iloc[train,:])\n",
+        "    train_target = titanic['Survived'].iloc[train]\n",
+        "    alg.fit(train_predictors, train_target)\n",
+        "    test_predictions = alg.predict(titanic[predictors].iloc[test,:])\n",
+        "    predictions.append(test_predictions)"
+       ]
+      },
+      {
+       "cell_type": "markdown",
+       "metadata": {},
+       "source": [
+        "Concatenating three prediction np arrays into one, and mapping the predictions to outcomes. Then, calculating the accuracy: "
+       ]
+      },
+      {
+       "cell_type": "markdown",
+       "metadata": {
+        "collapsed": false
+       },
+       "source": [
+        "predictions = np.concatenate(predictions, axis=0)\n",
+        "\n",
+        "predictions[predictions > 0.5] = 1\n",
+        "predictions[predictions <= 0.5] = 0\n",
+        "\n",
+        "accuracy = sum(predictions[predictions == titanic['Survived']]) / len(predictions)"
+       ]
+      },
+      {
+       "cell_type": "markdown",
+       "metadata": {},
+       "source": [
+        "Computing accuracy score for all cross validation folds, and taking mean of scores"
+       ]
+      },
+      {
+       "cell_type": "markdown",
+       "metadata": {},
+       "source": [
+        "Cleaning test data. Filling missing NaN values and replacing text values with number codes: "
+       ]
+      },
+      {
+       "cell_type": "markdown",
+       "metadata": {},
+       "source": [
+        "Generating submission for competition - training algorithm, making predictions, and creating dataframe with the columns needed: "
+       ]
+      },
+      {
+       "cell_type": "code",
+       "execution_count": 50,
+       "metadata": {
+        "collapsed": true
+       },
+       "outputs": [],
+       "source": [
+        "alg.fit(titanic[predictors], titanic['Survived'])\n",
+        "\n",
+        "predictions = alg.predict(test[predictors])\n",
+        "\n",
+        "submission = pd.DataFrame({\n",
+        "        'PassengerId': test['PassengerId'],\n",
+        "        'Survived': predictions\n",
+        "    })"
+       ]
+      },
+      {
+       "cell_type": "code",
+       "execution_count": 51,
+       "metadata": {
+        "collapsed": true
+       },
+       "outputs": [],
+       "source": [
+        "submission.to_csv('kaggle.csv', index=False)"
+       ]
+      }
+     ],
+     "metadata": {
+      "kernelspec": {
+       "display_name": "Python 2",
+       "language": "python",
+       "name": "python2"
+      },
+      "language_info": {
+       "codemirror_mode": {
+        "name": "ipython",
+        "version": 2
+       },
+       "file_extension": ".py",
+       "mimetype": "text/x-python",
+       "name": "python",
+       "nbconvert_exporter": "python",
+       "pygments_lexer": "ipython2",
+       "version": "2.7.11"
+      }
+     },
+     "nbformat": 4,
+     "nbformat_minor": 0
+    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kiki Chandra's submission (forgot to submit pull request earlier) #10

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Kiki Chandra's submission (forgot to submit pull request earlier) #10

Are you sure you want to change the base?

Uh oh!

Kiki Chandra's submission (forgot to submit pull request earlier) #10

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!