PCA in Python with SciKit Learn

PCA Transformed

Lets have a quick look at using Principal component analysis (PCA) an the Iris dataset.

What is PCA?

The simpilistic way to describe PCA is – it that it is one of many dimensionality reduction techniques, but its a very popular one. In fact, many of the tutorials or guides you find for machine learning typically use this technique in their work, even if its just for testing.

In short, if takes a lot of dimensions (variables) and reduces them to fewer. The key difference is that once the dataset it transformed the new variables become ‘meaningless’ or ‘namesless’.

The two key places to use PCA (or any dimenstionality reduction technique) is too…

  • Reduce the number of features you have – if the dataset is too broad and you perhaps want to train a ML model quicker.
  • Visualisation – we can only really visualise data in 3 dimensions, so PCA can be good to reduce higher dimensions to 2 or 3. Typically most people just display as 2D.

A more detailed explanation of PCA can be found on Page 65 – [Learning scikit-learn: Machine Learning in Python].


Our plan…

  • Load the IRIS dataset (4 features and 1 target)
  • Visualise the dataset
  • Build the PCA model
  • Transform the IRIS data to 2 dimensions
  • Visualise again

Load the data

The first step is to load the libraries you need. Here I am using the Anaconda distrubtion of Python 3, so it has everything I need already.

NOTE: %matplotlib inline – it is because I am doing the work in Jupyter Notebooks

I’ll set some options too, to stop pandas displaying too much (or too little) data.

Next we actually load the data. The ‘from sklearn import datasets’ contains the dataset so loading it is easy.

We also split it into X for the input variables and y for the classes.

Lets have a look at the feature names and the class labels.

You sould see…

[‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’]


[‘setosa’ ‘versicolor’ ‘virginica’]

Clean the Dataset

It doesn’t need cleaning as such, but I like to work in Pandas Dataframes with small datasets. It lets you see what you are doing a little clearer.

Here we convert the data (X) to a dataframe and add the feature names (minus spaces and units of measure). Next we add the class labels.

You should see…

PCA Iris Data



Now we are ready to plot the data.

First lets get the unique label names.

This will give you…

[‘setosa’, ‘versicolor’, ‘virginica’]

We will loop through these and plot each group (helps colour them up too).

Here we created 2 subplots, the first looking at sepal values and the second looking at petal values. Really we could actually look at each dimension against each other, a scatter matrix is good for that (See: Seaborn).

Out plot looks like this…

I’m not going to comment on much here, you can see how each iris is represented. Perhaps the axis labels would have been a nice addition.


Now lets do the PCA.

First I put the features back in there own array. I didn’t really need to do this, but I like X to be the inputs.

Next we create the PCA model which only really needs the number of componets, here we are converting the data (X) down to 2 features.

We then fit the PCA model and the use the model to transform the data. I save the transformed data as (X_).

Add it all back into a Dataframe, mine is called dfPCA.

We now have this…

PCA Data Transformed

were we can see that we only have 2 features (x1 and x2) and the class label.

Plot the PCA data

Finally lets plot the PCA features…

which gives us…PCA Transformed


The results for the PCA transformation has worked well, with this data. You should be able to use a range of different classifiers on this new data and they should perform well.

Titanic 3 – Model Model Evaulation

This part directly follows on from the Titanic Logistic Regression model we built, so you need to work through that part.

NOTE: You should put this code at the bottom of the code you have already created from this.

Load the package

For this we are going to be using the SciKit Learn metrics package. We load this as shown below. This code says, from the metrics package load everything (*). It is better practice to bring in only the modules we need but it is simpler to bring them all in.

You can put the above code where you left off, I always prefer to put it in the cell with all the other packages. So my code looks like this…

If you do it my way, remember to re-run the cell. Otherwise the package will not load.

Predict all of x

Previously we predicted one row at a time. Now we are going to predict the whole dataset. This is done in the same way as previously done.

This used the model called ‘result’ to predict the survival probability for all of the passengers we have data for.

If we run ‘pred’ we can see the predictions as values between 0 and 1.

Confusion Matrix

Now we can start looking at how good our model is by comparing the true survival outcomes against what the model predicted. A nice way of doing this is using  confusion matrix. Below is an example of what a confusion matrix looks like.

Confusion Matrix ExampleThe confusion matrix can give us a lot of information. At first glance the confusion matrix shows the number of correct predicts and number of incorrect predictions. Lets assume the example above shows survival (yes they survived, no they didn’t survive) we can see the following…

  • 100 people that where predicted to survive actually survived (Correct classification)
  • 50 people that where predicted to not survive actually didn’t survive (Correct classification)
  • 10 people where predicted to survive but actually died (False Positive / Type 1 error)
  • 5 people where predicted to die but actually survived (False Negative / Type 2 error)

Some other useful measure can be calculated from the confusion matrix, discussed on wikipedia, including:

  • Accuracy
  • Positive Predicted Value (PPV)
  • Negative Predicted Value (NPV)
  • Precision
  • Recall
  • Sensitivity
  • Specificity
  • Etc, etc, etc

So now lets look at our confusion matrix. In our code we use np.round(pred, 0), this rounds the prediction score to either 0 or 1. This is important because the confusion matrix compares the classification, so did the person survive or not. By doing this we assume that any prediction over 0.5 means they survived and anything below means they did not. 0.5 is the common cut off however the optimal cut off can be calculated using the Youden-J index (I will create a tutorial at some point).

This should give us this…

This is not pretty, and you need to be careful when interpreting it. Our first job it to identify which are the true positives and which are the true negatives, the the false positives and the false negatives.

A nicer way of viewing this is…

This gives us…

Titanic Confusion MatrixHere we can see the performance of our model as…

  • correctly identifying 210 people who survived
  • correctly identifying 363 people who did not survived
  • incorrectly predicted that 61 would survive but actually died
  • incorrectly predicted that 80 people would die but actually survived

Next we want to find out how accurate the model was, we can manually calculate this or we can quickly run this code…

My model has an accuracy of 0.80252100…. meaning that the model has an accuracy of 80%.

ROC Plot

The Receiver Operator Characteristic (ROC) plot is a popular method of presenting the performance of a classifier. For this to work your predictions need to be on a scale of 0 to 1, and not just 0’s or 1’s. The plot shows the trade-off between sensitivity and specificity of the model as the threshold changes. These are also refered to as the ‘false positive rate’ (FPR) and the ‘true positive rate’ (TPR).

To produce this we need to calculate the TPR and FPR at different thresholds, SciKit Learn does this for us…

Next we simply need to plot the FPR and TPR.

This gives us…

Titanic ROC PlotNote: FPR is 1 – Specificity, Sensitivity and Specificity are more commonly known than TPR and FPR.

The plot shows a smooth curve which is good. It shows that we can adjust the threshold to increase sensitivity and the cost of specificity, and visa versa. An example can be seen…

  • If we have a sensitivity of 0.8 we have a specificity of about 0.78 (1 – 0.22)
  • If we change the threshold to increase sensitivity to 0.9 we have a specificity of around 0.4 (1 – 0.6).

From this we also can also calculate the Area Under the Curve (AUC). This is a good measure of performance. An AUC of 0.5 means the model is not very good, it is no better than a 50/50 guess. If we have an AUC of less than 0.5 then something went wrong. When you read clinical papers looking at predicting life/death or illness/no-illness then an AUC greater than 0.7 is good and an AUC greater than 0.8 is very good.

We calculate ours using…

Note: we use the TPR and FPR from the ROC plot.

This gives us an AUC of 0.86328887….. or 86%, which is pretty good.

Titanic 2 – Logistic Regression

So now we have cleaned the data as outlined previously we can now build our first model. Here we are going to develop a logistic regression (LR) model using the cleaned data.

One of the main advantages of LR models is that they give a clear explanation on how the variables influence the outcome, in our case likelihood of survival. For example, as age increases how does the risk change? This comes in the form of a coefficient. The main limitation of LR models, from a predictive model point of view, is that they are linear models. This means they can not model complex relationships between variables and outcomes.

Lets get started.

Load the packages

For this we are going to need a couple of packages, all of these should be included in the Anaconda Python distribution.

We also set some Pandas options as before along with telling Matplotlib to use inline to display the plots in the notebook.

Load the data

This is part of a bigger project so I am hoping you have worked through the first part of the project. If so you should have a good understanding of what we have done to the data. If not, you can get the data Here (Titanic_Clean.csv).

Next we load the data.

If you get an error make sure you have the datafile in the Data folder.

Modifying the Data

We need to do a few more changes to the data to make it suitable for the LR model. First we need to drop a couple of features and second we need to add an intercept.

Dropping some features

Previously we converted Sex and Embarked to dummy variables. For logistic regression to work correctly we need to make one Sex and one Embarked variable the ‘reference’ variable. For example, if we make ‘male’ the reference variable then if ‘Female’ is zero then the model will be for males. If ‘female’ is one, then the model will show how the change in risk compared to males, if all over variables are equal.

In software like SPSS you would state which variable is the reference variable, in R it would use the first one. In our example we simply drop the reference variable.

Here we drop the male variable and if the passenger boarded in Cherbourg. So if all the other embarked variables are zero and sex is not female then the prediction is for males boarding in Cherbourg.


Next we need to add an intercept column. This is a new column with all the values set to one. The intercept is the predicted outcome if all the other variables are zero. Here I’ve called it ‘_intercept’.

Split the data into x and y

The last thing we will do with the data is split it into two parts, it just makes it easier later on. Here we split the explanatory variable (survived) as y and the independent variables (everything else) as x. Therefore we can predict y (survived) based on x (variables).

You can view them by simply running x or y.

The Logistic Regression Model

Wow, after so much playing with the data we can finally build a model. That’s data science for you, 80% cleaning data, 20% model building.

That was easy ūüôā

Lets view the model…

And you should see this…

Titanic 1 – Data Exploration

In this section we are going to explore the dataset. This is an important step in the statistical/machine learning process as firstly we need to know more information about the data we are using and secondly we need to make a few alterations to the data itself.

This tutorial is part of the Titanic Project, you need to read that first..!

Make sure you have set up your workspace and grabbed the data, as explained here.

Next you will need to create a new notebook file in the titanic folder. I’ve called mine ‘Data Clean and Explore’.

Load the packages we need

For this part of the project we are going to need the following…

  • Pandas – for loading and storing the data

Now lets load them…

I’ve added a couple of extra lines, pd.set_option sets some pandas options (as I type this I realise how obvious this is). We are limiting the number of rows to 10 but increasing the number of columns to 101. I find it more useful to see a lot of columns.

Load the data

So now we need to load the data we have in. I always try to name my dataframes as df. It’s easier to write, most commonly used when looking at guides on the internet, etc, etc. You have to think a little harder when you have multiple dataframes, i.e. dfFood, dfPeople.

If this doesn’t work you either didn’t set up the folders correctly or you didn’t download the file correctly, or a bit of both.

You should be able to see all of the columns and 10 rows of data. The data means….

  • survival – Survival (0 = No; 1 = Yes)
  • pclass – Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • name – Name
  • sex – Sex
  • age – Age
  • sibsp¬† – Number of Siblings/Spouses Aboard
  • parch – Number of Parents/Children Aboard
  • ticket – Ticket Number
  • fare – Passenger Fare
  • cabin – Cabin
  • embarked – Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

What does all this tell use? First we need to know the question or focus of the work, we are building a predictive model. We are also interested on finding out what features influenced survival chances. An example can be seen when asking, ‘are men less likely to survive then women’? Remember this was all before Equality Act 2010, so ‘women and children first’..!

So in our work we are going to need to make some changes to the data, and delete some columns.

Data cleaning and manipulation

Quickly view the feature names.

Next we are going to drop ‘PassengerId’, ‘Name’ and ‘Ticket’. The reasons for these are…

  • PassengerId – this is a unique incrementing number, it should not influence the ‘risk’
  • Name – I don’t really want to know if people called Bob are more likely to survive than someone called Dave, plus we don’t have enough data to support this.
  • Ticket – Looks messy, and for similar reasons to PassengerId

So we drop them…

Next we need to look at the Cabin. We are going to be lazy in our example and change it to a Yes or No variable. So did they have a cabin or not. The better way of doing this could be split the cabin value to get the first letter as usually this represents which deck the cabin was on. This could be useful, this can be your homework ūüėČ

Just to test, we can see which passengers did not have a cabin by using this line…

We can see that there are 687 people who did not have a cabin.

Next lets create a new variable called ‘HasCabin’, this will be initially set as a NA, then set as 1 for having a cabin and 0 for not having one.

We can see the counts using…

Showing 687 do not have a cabin and 204 who do.

Next we drop the original Cabin column..

We should see our dataset looking nice and clean, but there is still more to do.

Create the dummy variables

Yes, creating dummy variables is a real term, I didn’t just make it up. Dummy variables essentially change categorical variables into numbers. Statistical models and machine learning models don’t like text. (I’ll try to write a short post on ordinal and nominal categorical variables sometime).

We need to create dummy variables for Embarked and Sex. This is easy with Pandas.

Now we just need to join them to the df dataframe.

Now we should see the dataset with 891 rows and 14 columns.

Next we need to do now is drop the original columns.

Only keep complete cases

Another thing that the statistical models and machine learning models don’t like is missing values, so we will only keep the rows which are complete.

We have now gone form 891 rows to 714 rows.

Save the data

We are going to use this data for the next parts, so we need to save the data out.

We’ve saved the dataframe as Titanic_Clean.csv in the Data folder.


See Part 2 – Visualisation or Part 3 – Logistic Regression.

Titanic Project

Here we look at different predictive models using Python for data science and machine learning. The aim is to run through some different methods to develop a basic understanding of the workflow for producing different explanatory and predictive models.

Setup the workspace and get the data

Like all of my projects, first you need to set up a root folder. So created a folder call ‘Titanic’, this folder should be used for all you .ipynb files (if you are using Jupyter Notebook) or .py files (if you are just using a python IDE). Next, in the Titanic folder create a folder call ‘Data’, this is where we will store the data file.

The data we are using here comes from Kaggle, you can read all about it here. For this work you should download this version, it is exactly the same but just renamed to Titanic.csv. Download the file to the Data folder.

Next you can have a look at the following parts…

Basic Imputation in R

Impin' ain't easy

Hello and welcome to another R Stats adventure (using the Stampy Longnose Voice). Today we’re looking at Imputation, or the guessing/estimation of missing values in data. Be warned, once you impute data you bias your findings. Motivation:¬†in my main data set I started with 6,000 records, reduced to 800 (selection criteria) and of these 400 had missing values. So my 6,000 record data set become only worth 400.

Read more

R vs Matlab vs Python (My Answer)

My Head Hurts

So some time back I started an ongoing post trying to compare R, Matlab and Python. Well my answer is simple, there is no answer. And if you think differently you should ask yourself, “am I just being a fan-boy”?

If you want to read my previous post it is here, if not here is¬†a quick summary of all 3…


  • Proprietary meaning you will have to pay
  • Powerful
  • GUI’s for stuff. You can to build a ANN there’s a GUI, fancy a blast with Fuzzy Logic, guess what, there’s a GUI.
  • Lots of toolboxes (but you pay for)
  • Has Simulink, cool for rapid experimenting. Plug in a webcam and you can be object tracking in minuts (maybe 100 minutes, but still minutes)
  • I hate how the windows go behind other windows¬†(sorry, had to be said)
  • Plenty of webinars

R / RStudio

  • Not Proprietary, everything is free
  • Rapidly became HUGE, like its everywhere. Want a job in machine learning or to be taken seriously in statistics? Then you need R on your CV..! (Facebook, Microsoft, Google all want R)
  • No GUIs, now that is painful. It just makes everything that little bit harder to see what is going on.
  • Lots of toolboxes (but called packages) and they are all FREE
  • Too many toolboxes, yep also a curse. You¬†always find a couple of toolboxes doing the same thing, which is best?
  • RStudio, makes it so much more user friendly¬†than the standard R environment. Don’t even try R without RStudio, seriously just don’t..!
  • Why am I still using <- when I know = works?

Python (more SciKit Learn really)

  • Rapid development
  • Open source
  • Multiple environments (Spyder and Notebook are my favourite)
  • Grown¬†in strength
  • I have to question, will it replace R? I don’t know, some people love it, others like R. We’ll have to see
  • Syntax like you’ve never seen before, seriously my tab key has worn down..!
  • Maybe getting over complex. You’ll need get to grips with Pandas and NumPy. I found handling data formats a bit of a pain.
  • Matlibplot outputs look a little naff, maybe I needed to play with it more
  • Some good Deep Learning stuff out there (thinking of Theano)
  • Finally, Anaconda, you need this distribution of Python.

So that’s it, all 3 are good. It depends on what you want to use it for. My fan-boy opinion is currently R,¬†looks good on the CV, has loads of packages and the graphs look nice. Also, sooo much support for everything you want to do.

Data Linkage and Anonymisation Workshop

Just got back from a workshop held at the Turing Gateway to Mathematics in Cambridge. The event had a range of fantastic speakers discussing issues around data linkage and data privacy.

Data Linkage

Chris Dibben (University of Edinburgh) introduced the idea of linking different sets of data about subjects from multiple sources. He gave the example of how data is collected from pre-birth (pregnancy records) all the way through a persons life until death (or just after). All of this data is stored is different locations with no unique identifier.

Read more