PCA in Python with SciKit Learn

PCA Transformed

Lets have a quick look at using Principal component analysis (PCA) an the Iris dataset.

What is PCA?

The simpilistic way to describe PCA is – it that it is one of many dimensionality reduction techniques, but its a very popular one. In fact, many of the tutorials or guides you find for machine learning typically use this technique in their work, even if its just for testing.

In short, if takes a lot of dimensions (variables) and reduces them to fewer. The key difference is that once the dataset it transformed the new variables become ‘meaningless’ or ‘namesless’.

The two key places to use PCA (or any dimenstionality reduction technique) is too…

  • Reduce the number of features you have – if the dataset is too broad and you perhaps want to train a ML model quicker.
  • Visualisation – we can only really visualise data in 3 dimensions, so PCA can be good to reduce higher dimensions to 2 or 3. Typically most people just display as 2D.

A more detailed explanation of PCA can be found on Page 65 – [Learning scikit-learn: Machine Learning in Python].


Our plan…

  • Load the IRIS dataset (4 features and 1 target)
  • Visualise the dataset
  • Build the PCA model
  • Transform the IRIS data to 2 dimensions
  • Visualise again

Load the data

The first step is to load the libraries you need. Here I am using the Anaconda distrubtion of Python 3, so it has everything I need already.

NOTE: %matplotlib inline – it is because I am doing the work in Jupyter Notebooks

I’ll set some options too, to stop pandas displaying too much (or too little) data.

Next we actually load the data. The ‘from sklearn import datasets’ contains the dataset so loading it is easy.

We also split it into X for the input variables and y for the classes.

Lets have a look at the feature names and the class labels.

You sould see…

[‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’]


[‘setosa’ ‘versicolor’ ‘virginica’]

Clean the Dataset

It doesn’t need cleaning as such, but I like to work in Pandas Dataframes with small datasets. It lets you see what you are doing a little clearer.

Here we convert the data (X) to a dataframe and add the feature names (minus spaces and units of measure). Next we add the class labels.

You should see…

PCA Iris Data



Now we are ready to plot the data.

First lets get the unique label names.

This will give you…

[‘setosa’, ‘versicolor’, ‘virginica’]

We will loop through these and plot each group (helps colour them up too).

Here we created 2 subplots, the first looking at sepal values and the second looking at petal values. Really we could actually look at each dimension against each other, a scatter matrix is good for that (See: Seaborn).

Out plot looks like this…

I’m not going to comment on much here, you can see how each iris is represented. Perhaps the axis labels would have been a nice addition.


Now lets do the PCA.

First I put the features back in there own array. I didn’t really need to do this, but I like X to be the inputs.

Next we create the PCA model which only really needs the number of componets, here we are converting the data (X) down to 2 features.

We then fit the PCA model and the use the model to transform the data. I save the transformed data as (X_).

Add it all back into a Dataframe, mine is called dfPCA.

We now have this…

PCA Data Transformed

were we can see that we only have 2 features (x1 and x2) and the class label.

Plot the PCA data

Finally lets plot the PCA features…

which gives us…PCA Transformed


The results for the PCA transformation has worked well, with this data. You should be able to use a range of different classifiers on this new data and they should perform well.

Titanic 3 – Model Model Evaulation

This part directly follows on from the Titanic Logistic Regression model we built, so you need to work through that part.

NOTE: You should put this code at the bottom of the code you have already created from this.

Load the package

For this we are going to be using the SciKit Learn metrics package. We load this as shown below. This code says, from the metrics package load everything (*). It is better practice to bring in only the modules we need but it is simpler to bring them all in.

You can put the above code where you left off, I always prefer to put it in the cell with all the other packages. So my code looks like this…

If you do it my way, remember to re-run the cell. Otherwise the package will not load.

Predict all of x

Previously we predicted one row at a time. Now we are going to predict the whole dataset. This is done in the same way as previously done.

This used the model called ‘result’ to predict the survival probability for all of the passengers we have data for.

If we run ‘pred’ we can see the predictions as values between 0 and 1.

Confusion Matrix

Now we can start looking at how good our model is by comparing the true survival outcomes against what the model predicted. A nice way of doing this is using  confusion matrix. Below is an example of what a confusion matrix looks like.

Confusion Matrix ExampleThe confusion matrix can give us a lot of information. At first glance the confusion matrix shows the number of correct predicts and number of incorrect predictions. Lets assume the example above shows survival (yes they survived, no they didn’t survive) we can see the following…

  • 100 people that where predicted to survive actually survived (Correct classification)
  • 50 people that where predicted to not survive actually didn’t survive (Correct classification)
  • 10 people where predicted to survive but actually died (False Positive / Type 1 error)
  • 5 people where predicted to die but actually survived (False Negative / Type 2 error)

Some other useful measure can be calculated from the confusion matrix, discussed on wikipedia, including:

  • Accuracy
  • Positive Predicted Value (PPV)
  • Negative Predicted Value (NPV)
  • Precision
  • Recall
  • Sensitivity
  • Specificity
  • Etc, etc, etc

So now lets look at our confusion matrix. In our code we use np.round(pred, 0), this rounds the prediction score to either 0 or 1. This is important because the confusion matrix compares the classification, so did the person survive or not. By doing this we assume that any prediction over 0.5 means they survived and anything below means they did not. 0.5 is the common cut off however the optimal cut off can be calculated using the Youden-J index (I will create a tutorial at some point).

This should give us this…

This is not pretty, and you need to be careful when interpreting it. Our first job it to identify which are the true positives and which are the true negatives, the the false positives and the false negatives.

A nicer way of viewing this is…

This gives us…

Titanic Confusion MatrixHere we can see the performance of our model as…

  • correctly identifying 210 people who survived
  • correctly identifying 363 people who did not survived
  • incorrectly predicted that 61 would survive but actually died
  • incorrectly predicted that 80 people would die but actually survived

Next we want to find out how accurate the model was, we can manually calculate this or we can quickly run this code…

My model has an accuracy of 0.80252100…. meaning that the model has an accuracy of 80%.

ROC Plot

The Receiver Operator Characteristic (ROC) plot is a popular method of presenting the performance of a classifier. For this to work your predictions need to be on a scale of 0 to 1, and not just 0’s or 1’s. The plot shows the trade-off between sensitivity and specificity of the model as the threshold changes. These are also refered to as the ‘false positive rate’ (FPR) and the ‘true positive rate’ (TPR).

To produce this we need to calculate the TPR and FPR at different thresholds, SciKit Learn does this for us…

Next we simply need to plot the FPR and TPR.

This gives us…

Titanic ROC PlotNote: FPR is 1 – Specificity, Sensitivity and Specificity are more commonly known than TPR and FPR.

The plot shows a smooth curve which is good. It shows that we can adjust the threshold to increase sensitivity and the cost of specificity, and visa versa. An example can be seen…

  • If we have a sensitivity of 0.8 we have a specificity of about 0.78 (1 – 0.22)
  • If we change the threshold to increase sensitivity to 0.9 we have a specificity of around 0.4 (1 – 0.6).

From this we also can also calculate the Area Under the Curve (AUC). This is a good measure of performance. An AUC of 0.5 means the model is not very good, it is no better than a 50/50 guess. If we have an AUC of less than 0.5 then something went wrong. When you read clinical papers looking at predicting life/death or illness/no-illness then an AUC greater than 0.7 is good and an AUC greater than 0.8 is very good.

We calculate ours using…

Note: we use the TPR and FPR from the ROC plot.

This gives us an AUC of 0.86328887….. or 86%, which is pretty good.

Titanic 2 – Logistic Regression

So now we have cleaned the data as outlined previously we can now build our first model. Here we are going to develop a logistic regression (LR) model using the cleaned data.

One of the main advantages of LR models is that they give a clear explanation on how the variables influence the outcome, in our case likelihood of survival. For example, as age increases how does the risk change? This comes in the form of a coefficient. The main limitation of LR models, from a predictive model point of view, is that they are linear models. This means they can not model complex relationships between variables and outcomes.

Lets get started.

Load the packages

For this we are going to need a couple of packages, all of these should be included in the Anaconda Python distribution.

We also set some Pandas options as before along with telling Matplotlib to use inline to display the plots in the notebook.

Load the data

This is part of a bigger project so I am hoping you have worked through the first part of the project. If so you should have a good understanding of what we have done to the data. If not, you can get the data Here (Titanic_Clean.csv).

Next we load the data.

If you get an error make sure you have the datafile in the Data folder.

Modifying the Data

We need to do a few more changes to the data to make it suitable for the LR model. First we need to drop a couple of features and second we need to add an intercept.

Dropping some features

Previously we converted Sex and Embarked to dummy variables. For logistic regression to work correctly we need to make one Sex and one Embarked variable the ‘reference’ variable. For example, if we make ‘male’ the reference variable then if ‘Female’ is zero then the model will be for males. If ‘female’ is one, then the model will show how the change in risk compared to males, if all over variables are equal.

In software like SPSS you would state which variable is the reference variable, in R it would use the first one. In our example we simply drop the reference variable.

Here we drop the male variable and if the passenger boarded in Cherbourg. So if all the other embarked variables are zero and sex is not female then the prediction is for males boarding in Cherbourg.


Next we need to add an intercept column. This is a new column with all the values set to one. The intercept is the predicted outcome if all the other variables are zero. Here I’ve called it ‘_intercept’.

Split the data into x and y

The last thing we will do with the data is split it into two parts, it just makes it easier later on. Here we split the explanatory variable (survived) as y and the independent variables (everything else) as x. Therefore we can predict y (survived) based on x (variables).

You can view them by simply running x or y.

The Logistic Regression Model

Wow, after so much playing with the data we can finally build a model. That’s data science for you, 80% cleaning data, 20% model building.

That was easy ūüôā

Lets view the model…

And you should see this…

Titanic 1 – Data Exploration

In this section we are going to explore the dataset. This is an important step in the statistical/machine learning process as firstly we need to know more information about the data we are using and secondly we need to make a few alterations to the data itself.

This tutorial is part of the Titanic Project, you need to read that first..!

Make sure you have set up your workspace and grabbed the data, as explained here.

Next you will need to create a new notebook file in the titanic folder. I’ve called mine ‘Data Clean and Explore’.

Load the packages we need

For this part of the project we are going to need the following…

  • Pandas – for loading and storing the data

Now lets load them…

I’ve added a couple of extra lines, pd.set_option sets some pandas options (as I type this I realise how obvious this is). We are limiting the number of rows to 10 but increasing the number of columns to 101. I find it more useful to see a lot of columns.

Load the data

So now we need to load the data we have in. I always try to name my dataframes as df. It’s easier to write, most commonly used when looking at guides on the internet, etc, etc. You have to think a little harder when you have multiple dataframes, i.e. dfFood, dfPeople.

If this doesn’t work you either didn’t set up the folders correctly or you didn’t download the file correctly, or a bit of both.

You should be able to see all of the columns and 10 rows of data. The data means….

  • survival – Survival (0 = No; 1 = Yes)
  • pclass – Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • name – Name
  • sex – Sex
  • age – Age
  • sibsp¬† – Number of Siblings/Spouses Aboard
  • parch – Number of Parents/Children Aboard
  • ticket – Ticket Number
  • fare – Passenger Fare
  • cabin – Cabin
  • embarked – Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

What does all this tell use? First we need to know the question or focus of the work, we are building a predictive model. We are also interested on finding out what features influenced survival chances. An example can be seen when asking, ‘are men less likely to survive then women’? Remember this was all before Equality Act 2010, so ‘women and children first’..!

So in our work we are going to need to make some changes to the data, and delete some columns.

Data cleaning and manipulation

Quickly view the feature names.

Next we are going to drop ‘PassengerId’, ‘Name’ and ‘Ticket’. The reasons for these are…

  • PassengerId – this is a unique incrementing number, it should not influence the ‘risk’
  • Name – I don’t really want to know if people called Bob are more likely to survive than someone called Dave, plus we don’t have enough data to support this.
  • Ticket – Looks messy, and for similar reasons to PassengerId

So we drop them…

Next we need to look at the Cabin. We are going to be lazy in our example and change it to a Yes or No variable. So did they have a cabin or not. The better way of doing this could be split the cabin value to get the first letter as usually this represents which deck the cabin was on. This could be useful, this can be your homework ūüėČ

Just to test, we can see which passengers did not have a cabin by using this line…

We can see that there are 687 people who did not have a cabin.

Next lets create a new variable called ‘HasCabin’, this will be initially set as a NA, then set as 1 for having a cabin and 0 for not having one.

We can see the counts using…

Showing 687 do not have a cabin and 204 who do.

Next we drop the original Cabin column..

We should see our dataset looking nice and clean, but there is still more to do.

Create the dummy variables

Yes, creating dummy variables is a real term, I didn’t just make it up. Dummy variables essentially change categorical variables into numbers. Statistical models and machine learning models don’t like text. (I’ll try to write a short post on ordinal and nominal categorical variables sometime).

We need to create dummy variables for Embarked and Sex. This is easy with Pandas.

Now we just need to join them to the df dataframe.

Now we should see the dataset with 891 rows and 14 columns.

Next we need to do now is drop the original columns.

Only keep complete cases

Another thing that the statistical models and machine learning models don’t like is missing values, so we will only keep the rows which are complete.

We have now gone form 891 rows to 714 rows.

Save the data

We are going to use this data for the next parts, so we need to save the data out.

We’ve saved the dataframe as Titanic_Clean.csv in the Data folder.


See Part 2 – Visualisation or Part 3 – Logistic Regression.

Titanic Project

Here we look at different predictive models using Python for data science and machine learning. The aim is to run through some different methods to develop a basic understanding of the workflow for producing different explanatory and predictive models.

Setup the workspace and get the data

Like all of my projects, first you need to set up a root folder. So created a folder call ‘Titanic’, this folder should be used for all you .ipynb files (if you are using Jupyter Notebook) or .py files (if you are just using a python IDE). Next, in the Titanic folder create a folder call ‘Data’, this is where we will store the data file.

The data we are using here comes from Kaggle, you can read all about it here. For this work you should download this version, it is exactly the same but just renamed to Titanic.csv. Download the file to the Data folder.

Next you can have a look at the following parts…

Essential Python Packages

Here is a list of my essential packages you will need for data science with Python. I will expand the list as I go.

Anaconda Python

This is my favourite flavour of Python. It contains everything you need for data analysis and machine learning. The main things I use which are included are…

  • Jupyter Notebook
  • SciKit-Lean
  • Matplotlib
  • Pandas
  • Numpy

You can get it here.

Seaborn: statistical data visualization

This is a fantastic package for visualising your data. It has a good mix of some statistical features too. Have a look here.

To install it you need to open command prompt/terminal and run this (required PIP)…


Basic SIFT in Python

I’ve been having a quick play with Scale-Invariant Feature Transform (SIFT) in Python. I had a few problems installing it (see here). In short SIFT finds the features of an image, a more detailed explanation can be seen here.

In this example we take a picture of the City Hall in Hull and run it through a simple implementation of SIFT to extract the features of the image.

We’ll use this image, mine is called ‘Proj4_img00000021.jpg’. Click on the image to get the full size version


First lets bring in the packages we need. We don’t really need Numpy or Pandas but its just habbit.

Next we need to load the image and display it. Remember, the file should be in the same folder as your Python or Notebook script.

This should show this…


Now lets converting it to gray scale.

Giving us this (colors are from the plt.imshow(), trust me its grayscale).


Now the SIFT bits….


Some examples online say cv2.SIFT() instead of cv2.xfeatures2d.SIFT_create(). Apparently SIFT_create() was moved to xfeatures2d because of a patent issue, I don’t know if this is true or not, I just know I had to use cv2.xfeatures2d.SIFT_create(). Also cv2.drawKeypoints(gray, kp, None, ….) insisted on a 3rd parameter, so I just passed it None which in python language is the same as Null.

When you open up the image you should see this…


As you can see, all of the key points have been detected.

Installing OpenCV on Mac (Maybe Windows too)

I just thought I’d have a play about with OpenCV (Python Version) but had a few problems installing it on my Mac. I would assume the same problem occurs on Windows too as the problem seems to be installing it simply with Anaconda Python.

So to start, I am using Anaconda Python and I don’t know what the side effects of this will be (i.e. it might update some other things which can cause you a problem).

First install OpenCV by running this in the terminal window….

conda install -c https://conda.anaconda.org/menpo opencv3

Once I run this I tried import cv2 in a python script but it failed giving this error…

libopencv_hdf.3.1.dylib requires version 12.0.0 or later, etc…

So next I updated h5py by reinstalling it using this…

conda install h5py

It worked…!

Installing Theano with Anaconda Python (Win)

Before we start, Yes I know this is not the official image for “Theano: A CPU and GPU Math Expression Compiler‚ÄĚ, but they didn’t have a cool logo, in fact they don’t have a logo at all.

The Problem

I am sure I read somewhere that Anaconda Python (Download) comes with Theano. But I might be mistaken, all I know it wasnt a quick install when I wanted to use it and I was continuously getting this error which suggested it wasn’t installed…

ImportError Traceback (most recent call last)
<ipython-input-3-de45f36b45a8> in <module>()
—-> 1 from theano import *

ImportError: No module named theano

Well here is how I fixed it.

NOTE: I am using Windows 8 64bit, Python 2.7.10 | Anaconda 2.4.0 (64-bit)

The Fix

  1. Installed the Anaconda Python distribution, make sure you dont have several versions installed.
  2. Install the prerequisites for Theano and Anaconda,¬†just open up command prompt and run ‘conda install mingw libpython‘.
  3. Download the latest version of Theano (here) and extract somewhere.
  4. In command prompt navigate to the folder where you extracted Theano too and run ‘python setup.py develop”
  5. Restarted your python terminal/instance and try to re-import Theano by ‘from theano import *’
  6. Enjoy using Theano

You might need to do some extra things, here is the official link.

EDIT: I had to¬†install Theano on a different Windows PC and didn’t have to do steps 3 and 4.

K Means Clustering in Python

K Means clustering is an unsupervised machine learning algorithm. An example of a supervised learning algorithm can be seen when looking at Neural Networks where the learning process involved both the inputs (x) and the outputs (y). During the learning process the error between the predicted outcome (predY) and actual outcome (y) is used to train the system. In an unsupervised method such as K Means clustering the outcome (y) variable is not used in the training process.

In this example we look at using the IRIS dataset and cover:

  • Importing the¬†sample IRIS dataset
  • Converting the dataset to a Pandas Dataframe
  • Visualising the classifications using scatter plots
  • Simple performance metrics


Requirements: I am using Anaconda Python Distribution which has everything you need including Pandas, NumPy, Matplotlib and importantly SciKit-Learn. I am also using iPython Notebook but you can use whatever IDE you want.

Setup the environment and load the data

Bring in the libraries you need.

Next load the data. Scikit Learn has some sample datasets, by previously importing the datasets we can use the following

You can view the data running each line individually.

I like to work with Pandas Dataframes, so we will convert the data into that format. Note that we have separated out the inputs (x) and the outputs/labels (y).

Visualise the data

It is always important to have a look at the data. We will do this by plotting two scatter plots. One looking at the Sepal values and another looking at Petal. We will also set it to use some colours so it is clearer.


Build the K Means Model

This is the easy part, providing you have the data in the correct format (which we do). Here we only need two lines. First we create the model and specify the number of clusters the model should find (n_clusters=3) next we fit the model to the data.

Next we can view the results. This is the classes that the model decided, remember this is unsupervised and classified these purely based on the data.

You should see something like this…

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0,
2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

Visualise the classifier results

Lets plot the actual classes against the predicted classes from the K Means model.

Here we are plotting the Petal Length and Width, however each plot changes the colors of the points using either c=colormap[y.Targets] for the original class and  c=colormap[model.labels_] for the predicted classess.

The result is….


Ignore the colours (at the moment). Because the the model is unsupervised it did not know which label (class 0, 1 or 2) to assign to each class.

The Fix

Here we are going to change the class labels, we are not changing the any of the classification groups we are simply giving each group the correct number. We need to do this for measuring the performance.

Using this code below we using the np.choose() to assign new values, basically we are changing the 1’s in the predicted values to 0’s and the 0’s to 1’s. Class 2 matched so we can leave. By running the two print functions you can see that all we have done is swap the values.

NOTE: your results might be different to mine, if so you will have to figure out which class matches which and adjust the order of the values in the np.choose() function.


Now we can re plot the data as before but using predY instead of model.labels_.


No we can see that the K Means classifier has identified one class correctly (red) but some blacks have been classed as greens and vice versa.

Performance Measures

There are a number of ways in which we can measure a classifiers performance. Here we will calculate the accuracy and also the confusion matrix.

We need to values y which is the true (original) values and predY which are the models values.


My result was 0.89333333333333331, so we can say that the model has an accuracy of 89.3%. Not base considering the model was unsupervised.

Confusion Matrix

My results are…

array([[50, 0, 0],
[ 0, 48, 2],
[ 0, 14, 36]])

Hopefully the table below will render correctly, but we can summaries a the confusion matrix as shown below:

  • correctly identifed all 0 classes as 0’s
  • correctly classified 48 class 1’s but miss-classified 2 class 1’s as class 2
  • correctly classified 36 class 2’s but miss-classified 14 class 2’s as class 1

Real Class
0 1 2
Predicted Class 0 50 0 0
1 0 48 2
2 0 14 36

The confusion matrix also allows for a wider range of performance metrics to be calculated, see here for more details: https://en.wikipedia.org/wiki/Confusion_matrix.


A Guide to using R – FREE eBook

I wrote a guide on using R and promised to release it once it had been approved. Anyways, here it is.

It focuses on the basics of R to get you started and includes…

  • The Basics
  • Entering Data
  • Selecting Data
  • Installing packages
  • Pots
  • Statistical Methods (basic)
  • Calling other scripts
  • Loops
  • If Statement
  • Creating a Function

Either click the huge download button or click Guide To Using R.

Please reference as…

Stamford, J. (2015). Guide to using R. 1st ed. [ebook] Hull. Available at: http://stamfordresearch.com/a-guide-to-using-r-free-ebook/ [Accessed 8 Sep. 2015].


Download Button




MSc Project – Atari Game State Representation using CNNs


I was recently asked to guest lecture at De Montfort University to the MSc students studying Intelligent Systems and Robotics, and Business Intelligence Systems and Data Mining. The lecturer was on my MSc project which I completed at the university last year.

The project was around the use of Convolutional Neural Networks¬†with¬†my work focusing on published work from Deepmind, specifically “Playing Atari with Deep Reinforcement Learning” (2013). You can view a Google blog focusing on the 2015 publication here.

Below is a copy of the slides I presented. They are slightly modified from my dissertation viva as I wanted to express to the students the steps taken throughout the project.

If you can’t see the slide below click here.


Python Perceptron (Re-visited)

I feel like I’ve been staring at RStudio for way too long, so I’ve decided to give Python (SciKit-Learn) another go. I really need to recommend Anaconda Python¬†for this, it contains everything you need for Scientific Python coding, including…

  • Python 2.7 (3.x is available)
  • SciKit-Learn – Everything machine learning related
  • Pandas – Dataframes (all kinds of dataframe stuff)
  • Matlibplot – The plotting¬†library
  • iPython Notebook – A web based IDE, I like using this
  • Spyder – A nice IDE, I’m still getting to grips with it

Seriously, just get Anaconda Python, it is FREE.

I have done  previous post on the exact same problem however this uses DataFrames and is hopefully a little neater. There is some tweaks to plotting the hyperplane / decision boundary.

Lets get started

 Load in the data

Wow look at the DataFrame in action. We have 3 columns, A, B and Targets. A and B are just the input values. The target is a dichotomous value of 0 or 1, this could represent No or Yes, Product A or Product B, Dead or Alive, etc.

Plot the data (optional)

I always like to plot the data, I think its good practice to see what you are doing.

You should see this. If not, you might have forgotten the inline thing (above) or your install of Python is missing something.

Perceptron 1

Plotted data

Build the model and train

View the coefficients (optional)

I like to see what is going on

Plot the hyperplane / decision boundary

You should see this…

Perceptron 2

A scatter graph showing the data and the hyperplane. Everything to the right it a 1, everything to the left is a 0.

Using the system to make a prediction (and a confusion matrix)

Really we should be passing different data to it here, but here we can see the code to use the perceptron to predict the outcome only based on the inputs (in our case A and B).

The code also outputs a confusion matrix, it looks horrible.


Try this data to…

  • Use in the prediction of this model, how well does the system perform?
  • Rebuild the full model using this data
    • See how the hyperplane has moved?


Basic Imputation in R

Impin' ain't easy

Hello and welcome to another R Stats adventure (using the Stampy Longnose Voice). Today we’re looking at Imputation, or the guessing/estimation of missing values in data. Be warned, once you impute data you bias your findings. Motivation:¬†in my main data set I started with 6,000 records, reduced to 800 (selection criteria) and of these 400 had missing values. So my 6,000 record data set become only worth 400.

Read more

1 2 3