So now we have cleaned the data as outlined previously we can now build our first model. Here we are going to develop a logistic regression (LR) model using the cleaned data.

One of the main advantages of LR models is that they give a clear explanation on how the variables influence the outcome, in our case likelihood of survival. For example, as age increases how does the risk change? This comes in the form of a coefficient. The main limitation of LR models, from a predictive model point of view, is that they are linear models. This means they can not model complex relationships between variables and outcomes.

Lets get started.

## Load the packages

For this we are going to need a couple of packages, all of these should be included in the Anaconda Python distribution.

We also set some Pandas options as before along with telling Matplotlib to use inline to display the plots in the notebook.

1 2 3 4 5 6 7 8 9 |
import pandas as pd import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt pd.set_option("display.max_rows",10) pd.set_option("display.max_columns",101) %matplotlib inline |

## Load the data

This is part of a bigger project so I am hoping you have worked through the first part of the project. If so you should have a good understanding of what we have done to the data. If not, you can get the data Here (Titanic_Clean.csv).

Next we load the data.

1 2 |
df = pd.read_csv('Data/Titanic_Clean.csv') df |

If you get an error make sure you have the datafile in the Data folder.

## Modifying the Data

We need to do a few more changes to the data to make it suitable for the LR model. First we need to drop a couple of features and second we need to add an intercept.

**Dropping some features**

Previously we converted Sex and Embarked to dummy variables. For logistic regression to work correctly we need to make one Sex and one Embarked variable the **‘reference’ variable**. For example, if we make ‘male’ the reference variable then if ‘Female’ is zero then the model will be for males. If ‘female’ is one, then the model will show how the change in risk compared to males, if all over variables are equal.

In software like SPSS you would state which variable is the reference variable, in R it would use the first one. In our example we simply drop the reference variable.

Here we drop the male variable and if the passenger boarded in Cherbourg. So if all the other embarked variables are zero and sex is not female then the prediction is for males boarding in Cherbourg.

1 2 3 |
vars_to_drop = ['Sex__male','Embarked__C'] df = df.drop(vars_to_drop, axis=1) df |

### Intercept

Next we need to add an intercept column. This is a new column with all the values set to one. The intercept is the predicted outcome if all the other variables are zero. Here I’ve called it ‘_intercept’.

1 |
df['_intercept'] = 1 |

### Split the data into x and y

The last thing we will do with the data is split it into two parts, it just makes it easier later on. Here we split the explanatory variable (survived) as y and the independent variables (everything else) as x. Therefore we can predict y (survived) based on x (variables).

1 2 3 4 5 6 7 8 |
# Copy df across and drop Survived x = df x = x.drop('Survived', axis=1) # Set y as the survived column, we need # to wrap it in the dataframe to stop it # being series y = pd.DataFrame(df.Survived) |

You can view them by simply running x or y.

## The Logistic Regression Model

Wow, after so much playing with the data we can finally build a model. That’s data science for you, 80% cleaning data, 20% model building.

1 2 3 4 5 |
# Make the model logit = sm.Logit(y, x) # Fit the model result = logit.fit() |

That was easy đź™‚

Lets view the model…

1 |
print result.summary() |

And you should see this…

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Logit Regression Results ============================================================================== Dep. Variable: Survived No. Observations: 714 Model: Logit Df Residuals: 705 Method: MLE Df Model: 8 Date: Tue, 30 Aug 2016 Pseudo R-squ.: 0.3326 Time: 13:53:29 Log-Likelihood: -321.87 converged: True LL-Null: -482.26 LLR p-value: 1.542e-64 =============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------- Pclass -0.3952 0.104 -3.799 0.000 -0.599 -0.191 Age -0.0263 0.007 -3.973 0.000 -0.039 -0.013 SibSp -0.3100 0.127 -2.447 0.014 -0.558 -0.062 Parch -0.1083 0.120 -0.905 0.365 -0.343 0.126 Fare 0.0045 0.003 1.662 0.096 -0.001 0.010 HasCabin 1.1529 0.286 4.036 0.000 0.593 1.713 Embarked__Q -0.6688 0.587 -1.139 0.255 -1.819 0.482 Embarked__S -0.0893 0.258 -0.347 0.729 -0.594 0.415 Sex__female 2.6952 0.217 12.413 0.000 2.270 3.121 =============================================================================== |

Comments are closed.