Titanic 1 – Data Exploration

In this section we are going to explore the dataset. This is an important step in the statistical/machine learning process as firstly we need to know more information about the data we are using and secondly we need to make a few alterations to the data itself.

This tutorial is part of the Titanic Project, you need to read that first..!

Make sure you have set up your workspace and grabbed the data, as explained here.

Next you will need to create a new notebook file in the titanic folder. I’ve called mine ‘Data Clean and Explore’.

Load the packages we need

For this part of the project we are going to need the following…

  • Pandas – for loading and storing the data

Now lets load them…

I’ve added a couple of extra lines, pd.set_option sets some pandas options (as I type this I realise how obvious this is). We are limiting the number of rows to 10 but increasing the number of columns to 101. I find it more useful to see a lot of columns.

Load the data

So now we need to load the data we have in. I always try to name my dataframes as df. It’s easier to write, most commonly used when looking at guides on the internet, etc, etc. You have to think a little harder when you have multiple dataframes, i.e. dfFood, dfPeople.

If this doesn’t work you either didn’t set up the folders correctly or you didn’t download the file correctly, or a bit of both.

You should be able to see all of the columns and 10 rows of data. The data means….

  • survival – Survival (0 = No; 1 = Yes)
  • pclass – Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • name – Name
  • sex – Sex
  • age – Age
  • sibsp  – Number of Siblings/Spouses Aboard
  • parch – Number of Parents/Children Aboard
  • ticket – Ticket Number
  • fare – Passenger Fare
  • cabin – Cabin
  • embarked – Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

What does all this tell use? First we need to know the question or focus of the work, we are building a predictive model. We are also interested on finding out what features influenced survival chances. An example can be seen when asking, ‘are men less likely to survive then women’? Remember this was all before Equality Act 2010, so ‘women and children first’..!

So in our work we are going to need to make some changes to the data, and delete some columns.

Data cleaning and manipulation

Quickly view the feature names.

Next we are going to drop ‘PassengerId’, ‘Name’ and ‘Ticket’. The reasons for these are…

  • PassengerId – this is a unique incrementing number, it should not influence the ‘risk’
  • Name – I don’t really want to know if people called Bob are more likely to survive than someone called Dave, plus we don’t have enough data to support this.
  • Ticket – Looks messy, and for similar reasons to PassengerId

So we drop them…

Next we need to look at the Cabin. We are going to be lazy in our example and change it to a Yes or No variable. So did they have a cabin or not. The better way of doing this could be split the cabin value to get the first letter as usually this represents which deck the cabin was on. This could be useful, this can be your homework 😉

Just to test, we can see which passengers did not have a cabin by using this line…

We can see that there are 687 people who did not have a cabin.

Next lets create a new variable called ‘HasCabin’, this will be initially set as a NA, then set as 1 for having a cabin and 0 for not having one.

We can see the counts using…

Showing 687 do not have a cabin and 204 who do.

Next we drop the original Cabin column..

We should see our dataset looking nice and clean, but there is still more to do.

Create the dummy variables

Yes, creating dummy variables is a real term, I didn’t just make it up. Dummy variables essentially change categorical variables into numbers. Statistical models and machine learning models don’t like text. (I’ll try to write a short post on ordinal and nominal categorical variables sometime).

We need to create dummy variables for Embarked and Sex. This is easy with Pandas.

Now we just need to join them to the df dataframe.

Now we should see the dataset with 891 rows and 14 columns.

Next we need to do now is drop the original columns.

Only keep complete cases

Another thing that the statistical models and machine learning models don’t like is missing values, so we will only keep the rows which are complete.

We have now gone form 891 rows to 714 rows.

Save the data

We are going to use this data for the next parts, so we need to save the data out.

We’ve saved the dataframe as Titanic_Clean.csv in the Data folder.

Next

See Part 2 – Visualisation or Part 3 – Logistic Regression.

2 thoughts on “Titanic 1 – Data Exploration”

Comments are closed.