Close # Titanic 2 – Logistic Regression

So now we have cleaned the data as outlined previously we can now build our first model. Here we are going to develop a logistic regression (LR) model using the cleaned data.

One of the main advantages of LR models is that they give a clear explanation on how the variables influence the outcome, in our case likelihood of survival. For example, as age increases how does the risk change? This comes in the form of a coefficient. The main limitation of LR models, from a predictive model point of view, is that they are linear models. This means they can not model complex relationships between variables and outcomes.

Lets get started.

For this we are going to need a couple of packages, all of these should be included in the Anaconda Python distribution.

We also set some Pandas options as before along with telling Matplotlib to use inline to display the plots in the notebook.

This is part of a bigger project so I am hoping you have worked through the first part of the project. If so you should have a good understanding of what we have done to the data. If not, you can get the data Here (Titanic_Clean.csv).

If you get an error make sure you have the datafile in the Data folder.

## Modifying the Data

We need to do a few more changes to the data to make it suitable for the LR model. First we need to drop a couple of features and second we need to add an intercept.

### Dropping some features

Previously we converted Sex and Embarked to dummy variables. For logistic regression to work correctly we need to make one Sex and one Embarked variable the ‘reference’ variable. For example, if we make ‘male’ the reference variable then if ‘Female’ is zero then the model will be for males. If ‘female’ is one, then the model will show how the change in risk compared to males, if all over variables are equal.

In software like SPSS you would state which variable is the reference variable, in R it would use the first one. In our example we simply drop the reference variable.

Here we drop the male variable and if the passenger boarded in Cherbourg. So if all the other embarked variables are zero and sex is not female then the prediction is for males boarding in Cherbourg.

### Intercept

Next we need to add an intercept column. This is a new column with all the values set to one. The intercept is the predicted outcome if all the other variables are zero. Here I’ve called it ‘_intercept’.

### Split the data into x and y

The last thing we will do with the data is split it into two parts, it just makes it easier later on. Here we split the explanatory variable (survived) as y and the independent variables (everything else) as x. Therefore we can predict y (survived) based on x (variables).

You can view them by simply running x or y.

## The Logistic Regression Model

Wow, after so much playing with the data we can finally build a model. That’s data science for you, 80% cleaning data, 20% model building.

That was easy 🙂

Lets view the model…

And you should see this…