Outlier removal in Python using IQR rule

My previous post ‘Outlier removal in R using IQR rule‘ has been one of the most visited posts on here. So now lets have a look at it in Python. This time we’ll be using Pandas and NumPy, along with the Titanic dataset. We will also do a little extra thing – log transform the data.

Read more

Basic SIFT in Python

I’ve been having a quick play with Scale-Invariant Feature Transform (SIFT) in Python. I had a few problems installing it (see here). In short SIFT finds the features of an image, a more detailed explanation can be seen here.

In this example we take a picture of the City Hall in Hull and run it through a simple implementation of SIFT to extract the features of the image.

We’ll use this image, mine is called ‘Proj4_img00000021.jpg’. Click on the image to get the full size version

Proj4_img00000021

First lets bring in the packages we need. We don’t really need Numpy or Pandas but its just habbit.

Next we need to load the image and display it. Remember, the file should be in the same folder as your Python or Notebook script.

This should show this…

OpenCV_Original

Now lets converting it to gray scale.

Giving us this (colors are from the plt.imshow(), trust me its grayscale).

OpenCV_gray

Now the SIFT bits….

Notes

Some examples online say cv2.SIFT() instead of cv2.xfeatures2d.SIFT_create(). Apparently SIFT_create() was moved to xfeatures2d because of a patent issue, I don’t know if this is true or not, I just know I had to use cv2.xfeatures2d.SIFT_create(). Also cv2.drawKeypoints(gray, kp, None, ….) insisted on a 3rd parameter, so I just passed it None which in python language is the same as Null.

When you open up the image you should see this…

OpenCV_SIFT_KP

As you can see, all of the key points have been detected.

Installing Theano with Anaconda Python (Win)

Before we start, Yes I know this is not the official image for “Theano: A CPU and GPU Math Expression Compiler”, but they didn’t have a cool logo, in fact they don’t have a logo at all.

The Problem

I am sure I read somewhere that Anaconda Python (Download) comes with Theano. But I might be mistaken, all I know it wasnt a quick install when I wanted to use it and I was continuously getting this error which suggested it wasn’t installed…

—————————————————————————
ImportError Traceback (most recent call last)
<ipython-input-3-de45f36b45a8> in <module>()
—-> 1 from theano import *

ImportError: No module named theano

Well here is how I fixed it.

NOTE: I am using Windows 8 64bit, Python 2.7.10 | Anaconda 2.4.0 (64-bit)

The Fix

  1. Installed the Anaconda Python distribution, make sure you dont have several versions installed.
  2. Install the prerequisites for Theano and Anaconda, just open up command prompt and run ‘conda install mingw libpython‘.
  3. Download the latest version of Theano (here) and extract somewhere.
  4. In command prompt navigate to the folder where you extracted Theano too and run ‘python setup.py develop”
  5. Restarted your python terminal/instance and try to re-import Theano by ‘from theano import *’
  6. Enjoy using Theano

You might need to do some extra things, here is the official link.

EDIT: I had to install Theano on a different Windows PC and didn’t have to do steps 3 and 4.

K Means Clustering in Python

K Means clustering is an unsupervised machine learning algorithm. An example of a supervised learning algorithm can be seen when looking at Neural Networks where the learning process involved both the inputs (x) and the outputs (y). During the learning process the error between the predicted outcome (predY) and actual outcome (y) is used to train the system. In an unsupervised method such as K Means clustering the outcome (y) variable is not used in the training process.

In this example we look at using the IRIS dataset and cover:

  • Importing the sample IRIS dataset
  • Converting the dataset to a Pandas Dataframe
  • Visualising the classifications using scatter plots
  • Simple performance metrics

 

Requirements: I am using Anaconda Python Distribution which has everything you need including Pandas, NumPy, Matplotlib and importantly SciKit-Learn. I am also using iPython Notebook but you can use whatever IDE you want.

Setup the environment and load the data

Bring in the libraries you need.

Next load the data. Scikit Learn has some sample datasets, by previously importing the datasets we can use the following

You can view the data running each line individually.

I like to work with Pandas Dataframes, so we will convert the data into that format. Note that we have separated out the inputs (x) and the outputs/labels (y).

Visualise the data

It is always important to have a look at the data. We will do this by plotting two scatter plots. One looking at the Sepal values and another looking at Petal. We will also set it to use some colours so it is clearer.

Iris_Original

Build the K Means Model

This is the easy part, providing you have the data in the correct format (which we do). Here we only need two lines. First we create the model and specify the number of clusters the model should find (n_clusters=3) next we fit the model to the data.

Next we can view the results. This is the classes that the model decided, remember this is unsupervised and classified these purely based on the data.

You should see something like this…

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0,
2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

Visualise the classifier results

Lets plot the actual classes against the predicted classes from the K Means model.

Here we are plotting the Petal Length and Width, however each plot changes the colors of the points using either c=colormap[y.Targets] for the original class and  c=colormap[model.labels_] for the predicted classess.

The result is….

Iris_Model1

Ignore the colours (at the moment). Because the the model is unsupervised it did not know which label (class 0, 1 or 2) to assign to each class.

The Fix

Here we are going to change the class labels, we are not changing the any of the classification groups we are simply giving each group the correct number. We need to do this for measuring the performance.

Using this code below we using the np.choose() to assign new values, basically we are changing the 1’s in the predicted values to 0’s and the 0’s to 1’s. Class 2 matched so we can leave. By running the two print functions you can see that all we have done is swap the values.

NOTE: your results might be different to mine, if so you will have to figure out which class matches which and adjust the order of the values in the np.choose() function.

Re-plot

Now we can re plot the data as before but using predY instead of model.labels_.

Iris_Model_Corrected

No we can see that the K Means classifier has identified one class correctly (red) but some blacks have been classed as greens and vice versa.

Performance Measures

There are a number of ways in which we can measure a classifiers performance. Here we will calculate the accuracy and also the confusion matrix.

We need to values y which is the true (original) values and predY which are the models values.

Accuracy

My result was 0.89333333333333331, so we can say that the model has an accuracy of 89.3%. Not base considering the model was unsupervised.

Confusion Matrix

My results are…

array([[50, 0, 0],
[ 0, 48, 2],
[ 0, 14, 36]])

Hopefully the table below will render correctly, but we can summaries a the confusion matrix as shown below:

  • correctly identifed all 0 classes as 0’s
  • correctly classified 48 class 1’s but miss-classified 2 class 1’s as class 2
  • correctly classified 36 class 2’s but miss-classified 14 class 2’s as class 1

Real Class
0 1 2
Predicted Class 0 50 0 0
1 0 48 2
2 0 14 36

The confusion matrix also allows for a wider range of performance metrics to be calculated, see here for more details: https://en.wikipedia.org/wiki/Confusion_matrix.

 

MSc Project – Atari Game State Representation using CNNs

Freeway

I was recently asked to guest lecture at De Montfort University to the MSc students studying Intelligent Systems and Robotics, and Business Intelligence Systems and Data Mining. The lecturer was on my MSc project which I completed at the university last year.

The project was around the use of Convolutional Neural Networks with my work focusing on published work from Deepmind, specifically “Playing Atari with Deep Reinforcement Learning” (2013). You can view a Google blog focusing on the 2015 publication here.

Below is a copy of the slides I presented. They are slightly modified from my dissertation viva as I wanted to express to the students the steps taken throughout the project.

If you can’t see the slide below click here.

 

R vs Matlab vs Python (My Answer)

My Head Hurts

So some time back I started an ongoing post trying to compare R, Matlab and Python. Well my answer is simple, there is no answer. And if you think differently you should ask yourself, “am I just being a fan-boy”?

If you want to read my previous post it is here, if not here is a quick summary of all 3…

Matlab

  • Proprietary meaning you will have to pay
  • Powerful
  • GUI’s for stuff. You can to build a ANN there’s a GUI, fancy a blast with Fuzzy Logic, guess what, there’s a GUI.
  • Lots of toolboxes (but you pay for)
  • Has Simulink, cool for rapid experimenting. Plug in a webcam and you can be object tracking in minuts (maybe 100 minutes, but still minutes)
  • I hate how the windows go behind other windows (sorry, had to be said)
  • Plenty of webinars

R / RStudio

  • Not Proprietary, everything is free
  • Rapidly became HUGE, like its everywhere. Want a job in machine learning or to be taken seriously in statistics? Then you need R on your CV..! (Facebook, Microsoft, Google all want R)
  • No GUIs, now that is painful. It just makes everything that little bit harder to see what is going on.
  • Lots of toolboxes (but called packages) and they are all FREE
  • Too many toolboxes, yep also a curse. You always find a couple of toolboxes doing the same thing, which is best?
  • RStudio, makes it so much more user friendly than the standard R environment. Don’t even try R without RStudio, seriously just don’t..!
  • Why am I still using <- when I know = works?

Python (more SciKit Learn really)

  • Rapid development
  • Open source
  • Multiple environments (Spyder and Notebook are my favourite)
  • Grown in strength
  • I have to question, will it replace R? I don’t know, some people love it, others like R. We’ll have to see
  • Syntax like you’ve never seen before, seriously my tab key has worn down..!
  • Maybe getting over complex. You’ll need get to grips with Pandas and NumPy. I found handling data formats a bit of a pain.
  • Matlibplot outputs look a little naff, maybe I needed to play with it more
  • Some good Deep Learning stuff out there (thinking of Theano)
  • Finally, Anaconda, you need this distribution of Python.

So that’s it, all 3 are good. It depends on what you want to use it for. My fan-boy opinion is currently R, looks good on the CV, has loads of packages and the graphs look nice. Also, sooo much support for everything you want to do.

Data Science ‘up North’

Data-science

So about 2 years ago I reached out to a fellow MSc student through Linkedin who was studying via Distance Learning. It turns out he was a chap based in San Francisco working at a big credit card company. His job title was ‘Data Scientist’ which I thought was novel. It turns out his job was developing methods and algorithms for detecting credit card thefts. We exchanged a couple of emails, general stuff about our plans after the course. He explained how the ‘data scientist’ role was starting to become big over there (USA). So I wondered, what is a data scientist? Why don’t we have any jobs like that around here?

What I found was these types of jobs are popular in London with the big financial companies and the emerging high tech companies with salaries ranging upwards of £60k per annum. Sounds a lot to a northerner but with higher living costs and  as stated by Partridge (2002) “I guarantee you’ll either be mugged or not appreciated” it raises the question, is it worth the move?

Slightly further a field, Cambridge also seems to have a high demand for Data Scientists as the city attracts more and more research arms of companies such as Philips and Microsoft. But as we move further up north the demand starts to drop, why?

My thought process?

Is it because we don’t have any businesses which will benefit from a data scientist?

I started thinking about this and it is true, we don’t have the big financial institutions or the large ‘high tech’ companies. But we do have some big companies. Just thinking in Hull alone we have BP, BAE SYSTEMS, Smith and Nephew, Reckitt Benckiser, ARCO and many more (I got bored thinking of more, sorry).  But then we add another question, how many data scientists does one company need?

But then I also thought back to a couple of companies I have worked for, a door manufacturer (£4 million turn over), a ship repair company (now with about a £8 million turn over). But neither of these companies would ever think about hiring a data scientist. But why? The answer is easy, they don’t what what a data scientist is or how they can help their business.

Lets look at me, I have a degree in IT and 15 years experience working in IT, including a couple of times as an IT manager. I’ve spent time managing databases, developing software with the aim of improving the company from a technical perspective. I’ve even taught computing from school to degree level. But never hired, worked as or worked with a Data Scientist.

This leads on to my next question…

Is it because we don’t have the skill set (as a population)?

So my experience of education has made me question, what the hell are we teaching? Lets look at Level 3 provision in Hull. We have three colleges and now every academy has a 6th form but no-one is teaching skills to become a data scientist. As a city, everything computing seems to be focused on making games, often quoting how much the game industry is worth.

Whats your point?

Advertising you teach Game Making is great, but it’s a little like advertising a course in Premiership Football Skills. You get a lot of hype among young people, you get a lot of applications for your course but very few actually end up playing for a premiership team. So the next argument is, “but we teaching people how to make their own games, you know all entrepreneurial and that!”. Good and I fully agree with that, but who is that good for? The chap who sets up the games company and a few of his/her mates? Look at Mojang, sold Minecraft for $2 billion, employs 50 people, Notch takes his $2 billion and buggers of to LA. So is games good for the individual or the local economy?

Lets compare that with the potential of a Data Scientist in Hull. Firstly as shown in the illustration somewhere on this page, a data scientist is multi-disciplinary. You need the data analyst skills, the computer science skills and business/context awareness. Your job, go into a business and make a real change. Look at the data, investigate why things are happening, questioning all the time, looking for statistical relevance. Going a step further with predictive modelling, ‘if we change this, what will happen?’, even further with machine learning. Moving away from just storing data to actually using it. So whats good for the economy? Well for a start companies will be able to understand their businesses in more depth, make them more efficient and possibly reduce costs. Companies in the local area will start catching up with the higher tech companies who are already employing data scientists.

My Simple Example (hopefully you’ll get it)

It felt like I ended up ranting, so I’ll finish with a little example I learnt from a school. In this school the teachers fill in a spreadsheet of pupils performance, they pass the spreadsheet to the deputy headteacher who transfers the data to a computer program. The computer system outputs the pupils target for that year, in other words what level they should be at. A couple of points….

  • None of the staff know how it works, it just works
  • The targets are fairly accurate with the pupils
  • The school can now track student progress more accurately
  • Intervention can be planned
  • Pupil performance is raised
  • To develop it a data scientist:
    • researched the domain
    • managed/manipulated the data
    • developed a statistically significant model
    • coded the software
    • tested it
    • supported the solution

Such a simple example but no-one is teaching the skills to be able to do this, whilst others are not seeing the need.

As promised a picture

DataScientist

64bit R and 32bit Access Problem [Fixed]

So I’ve had a little problem. I needed to open an Access Database in RStudio. Sounds easy right? Well the problem was that I have MS Office 2007 along with the default OBDC drivers, these are 32bit. I’m also using R and RStudio 64bit. They don’t work well together.

I looked for the 64 bit driver and found it here: Microsoft Access Database Engine 2010 Redistributable.

The next problem is, it doesn’t look like you can install this whilst you have Office 2007 installed. I read you can uninstall Office 2007, install the new driver and then re-install Office 2007.

I didn’t try that, I simply ran the new installer from command prompt with the /passive argument.

How to install without messing about…

  1. Download the driver above.
  2. Open up command prompt, ctrl+r then type cmd.
  3. Use dir to navigate to where you downloaded the driver.
  4. Type “AccessDatabaseEngine_x64.exe /passive” (without quotes) and hit Enter.
  5. Enjoy

 

No post December

Just realised I haven’t updated the blog this month. Which seems odd.

Well I’ve been pretty busy, started the month with a interim project review/update meeting. It felt like a pretty big deal but it was good to have my first paper drafted which showed really good progress.  The hardest part is balancing the ideas of multiple experts in the field who all think the project can go in different ways. In this review I have two professors, two senior academics and my very senior (and experienced) industrial supervisor.

Read more

Literature Review

The PhD thesis is apparently 80,000 words and approximately 10 chapters. Sounds a little daunting at first. I think I did 18,000 for my MSc project. We so far I’ve been plodding on with the literature review mainly looking at Home Telehealth Monitoring trials. At first it started getting messy with notes getting lost and swapping backwards and forwards between papers.

Read more

PhD Life so far

Just thought I’d do a quick post to reflect on the PhD process so far. Everything it going well, I think. I meet my third supervisor next week. I might have to do a presentation, I’ll probably find out about 10 minutes before I do. So I think I’m prepare one just in case. Nervous isn’t the best word to describe it, but  I was a little <whatever it’s called> before, then I google’d the chap and now I’m <whatever it’s called> a lot. No only a Professor of Cardiology but also one of the senior cardiology people in the country.

Read more

Data Linkage and Anonymisation Workshop

Just got back from a workshop held at the Turing Gateway to Mathematics in Cambridge. The event had a range of fantastic speakers discussing issues around data linkage and data privacy.

Data Linkage

Chris Dibben (University of Edinburgh) introduced the idea of linking different sets of data about subjects from multiple sources. He gave the example of how data is collected from pre-birth (pregnancy records) all the way through a persons life until death (or just after). All of this data is stored is different locations with no unique identifier.

Read more

1 2