PCA in Python with SciKit Learn

PCA Transformed

Lets have a quick look at using Principal component analysis (PCA) an the Iris dataset.

What is PCA?

The simpilistic way to describe PCA is – it that it is one of many dimensionality reduction techniques, but its a very popular one. In fact, many of the tutorials or guides you find for machine learning typically use this technique in their work, even if its just for testing.

In short, if takes a lot of dimensions (variables) and reduces them to fewer. The key difference is that once the dataset it transformed the new variables become ‘meaningless’ or ‘namesless’.

The two key places to use PCA (or any dimenstionality reduction technique) is too…

  • Reduce the number of features you have – if the dataset is too broad and you perhaps want to train a ML model quicker.
  • Visualisation – we can only really visualise data in 3 dimensions, so PCA can be good to reduce higher dimensions to 2 or 3. Typically most people just display as 2D.

A more detailed explanation of PCA can be found on Page 65 – [Learning scikit-learn: Machine Learning in Python].

Plan

Our plan…

  • Load the IRIS dataset (4 features and 1 target)
  • Visualise the dataset
  • Build the PCA model
  • Transform the IRIS data to 2 dimensions
  • Visualise again

Load the data

The first step is to load the libraries you need. Here I am using the Anaconda distrubtion of Python 3, so it has everything I need already.

NOTE: %matplotlib inline – it is because I am doing the work in Jupyter Notebooks

I’ll set some options too, to stop pandas displaying too much (or too little) data.

Next we actually load the data. The ‘from sklearn import datasets’ contains the dataset so loading it is easy.

We also split it into X for the input variables and y for the classes.

Lets have a look at the feature names and the class labels.

You sould see…

[‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’]

and

[‘setosa’ ‘versicolor’ ‘virginica’]

Clean the Dataset

It doesn’t need cleaning as such, but I like to work in Pandas Dataframes with small datasets. It lets you see what you are doing a little clearer.

Here we convert the data (X) to a dataframe and add the feature names (minus spaces and units of measure). Next we add the class labels.

You should see…

PCA Iris Data

 

Visualisation

Now we are ready to plot the data.

First lets get the unique label names.

This will give you…

[‘setosa’, ‘versicolor’, ‘virginica’]

We will loop through these and plot each group (helps colour them up too).

Here we created 2 subplots, the first looking at sepal values and the second looking at petal values. Really we could actually look at each dimension against each other, a scatter matrix is good for that (See: Seaborn).

Out plot looks like this…

I’m not going to comment on much here, you can see how each iris is represented. Perhaps the axis labels would have been a nice addition.

PCA

Now lets do the PCA.

First I put the features back in there own array. I didn’t really need to do this, but I like X to be the inputs.

Next we create the PCA model which only really needs the number of componets, here we are converting the data (X) down to 2 features.

We then fit the PCA model and the use the model to transform the data. I save the transformed data as (X_).

Add it all back into a Dataframe, mine is called dfPCA.

We now have this…

PCA Data Transformed

were we can see that we only have 2 features (x1 and x2) and the class label.

Plot the PCA data

Finally lets plot the PCA features…

which gives us…PCA Transformed

Conclusion

The results for the PCA transformation has worked well, with this data. You should be able to use a range of different classifiers on this new data and they should perform well.

Titanic 3 – Model Model Evaulation

This part directly follows on from the Titanic Logistic Regression model we built, so you need to work through that part.

NOTE: You should put this code at the bottom of the code you have already created from this.

Load the package

For this we are going to be using the SciKit Learn metrics package. We load this as shown below. This code says, from the metrics package load everything (*). It is better practice to bring in only the modules we need but it is simpler to bring them all in.

You can put the above code where you left off, I always prefer to put it in the cell with all the other packages. So my code looks like this…

If you do it my way, remember to re-run the cell. Otherwise the package will not load.

Predict all of x

Previously we predicted one row at a time. Now we are going to predict the whole dataset. This is done in the same way as previously done.

This used the model called ‘result’ to predict the survival probability for all of the passengers we have data for.

If we run ‘pred’ we can see the predictions as values between 0 and 1.

Confusion Matrix

Now we can start looking at how good our model is by comparing the true survival outcomes against what the model predicted. A nice way of doing this is using  confusion matrix. Below is an example of what a confusion matrix looks like.

Confusion Matrix ExampleThe confusion matrix can give us a lot of information. At first glance the confusion matrix shows the number of correct predicts and number of incorrect predictions. Lets assume the example above shows survival (yes they survived, no they didn’t survive) we can see the following…

  • 100 people that where predicted to survive actually survived (Correct classification)
  • 50 people that where predicted to not survive actually didn’t survive (Correct classification)
  • 10 people where predicted to survive but actually died (False Positive / Type 1 error)
  • 5 people where predicted to die but actually survived (False Negative / Type 2 error)

Some other useful measure can be calculated from the confusion matrix, discussed on wikipedia, including:

  • Accuracy
  • Positive Predicted Value (PPV)
  • Negative Predicted Value (NPV)
  • Precision
  • Recall
  • Sensitivity
  • Specificity
  • Etc, etc, etc

So now lets look at our confusion matrix. In our code we use np.round(pred, 0), this rounds the prediction score to either 0 or 1. This is important because the confusion matrix compares the classification, so did the person survive or not. By doing this we assume that any prediction over 0.5 means they survived and anything below means they did not. 0.5 is the common cut off however the optimal cut off can be calculated using the Youden-J index (I will create a tutorial at some point).

This should give us this…

This is not pretty, and you need to be careful when interpreting it. Our first job it to identify which are the true positives and which are the true negatives, the the false positives and the false negatives.

A nicer way of viewing this is…

This gives us…

Titanic Confusion MatrixHere we can see the performance of our model as…

  • correctly identifying 210 people who survived
  • correctly identifying 363 people who did not survived
  • incorrectly predicted that 61 would survive but actually died
  • incorrectly predicted that 80 people would die but actually survived

Next we want to find out how accurate the model was, we can manually calculate this or we can quickly run this code…

My model has an accuracy of 0.80252100…. meaning that the model has an accuracy of 80%.

ROC Plot

The Receiver Operator Characteristic (ROC) plot is a popular method of presenting the performance of a classifier. For this to work your predictions need to be on a scale of 0 to 1, and not just 0’s or 1’s. The plot shows the trade-off between sensitivity and specificity of the model as the threshold changes. These are also refered to as the ‘false positive rate’ (FPR) and the ‘true positive rate’ (TPR).

To produce this we need to calculate the TPR and FPR at different thresholds, SciKit Learn does this for us…

Next we simply need to plot the FPR and TPR.

This gives us…

Titanic ROC PlotNote: FPR is 1 – Specificity, Sensitivity and Specificity are more commonly known than TPR and FPR.

The plot shows a smooth curve which is good. It shows that we can adjust the threshold to increase sensitivity and the cost of specificity, and visa versa. An example can be seen…

  • If we have a sensitivity of 0.8 we have a specificity of about 0.78 (1 – 0.22)
  • If we change the threshold to increase sensitivity to 0.9 we have a specificity of around 0.4 (1 – 0.6).

From this we also can also calculate the Area Under the Curve (AUC). This is a good measure of performance. An AUC of 0.5 means the model is not very good, it is no better than a 50/50 guess. If we have an AUC of less than 0.5 then something went wrong. When you read clinical papers looking at predicting life/death or illness/no-illness then an AUC greater than 0.7 is good and an AUC greater than 0.8 is very good.

We calculate ours using…

Note: we use the TPR and FPR from the ROC plot.

This gives us an AUC of 0.86328887….. or 86%, which is pretty good.

Titanic 2 – Logistic Regression

So now we have cleaned the data as outlined previously we can now build our first model. Here we are going to develop a logistic regression (LR) model using the cleaned data.

One of the main advantages of LR models is that they give a clear explanation on how the variables influence the outcome, in our case likelihood of survival. For example, as age increases how does the risk change? This comes in the form of a coefficient. The main limitation of LR models, from a predictive model point of view, is that they are linear models. This means they can not model complex relationships between variables and outcomes.

Lets get started.

Load the packages

For this we are going to need a couple of packages, all of these should be included in the Anaconda Python distribution.

We also set some Pandas options as before along with telling Matplotlib to use inline to display the plots in the notebook.

Load the data

This is part of a bigger project so I am hoping you have worked through the first part of the project. If so you should have a good understanding of what we have done to the data. If not, you can get the data Here (Titanic_Clean.csv).

Next we load the data.

If you get an error make sure you have the datafile in the Data folder.

Modifying the Data

We need to do a few more changes to the data to make it suitable for the LR model. First we need to drop a couple of features and second we need to add an intercept.

Dropping some features

Previously we converted Sex and Embarked to dummy variables. For logistic regression to work correctly we need to make one Sex and one Embarked variable the ‘reference’ variable. For example, if we make ‘male’ the reference variable then if ‘Female’ is zero then the model will be for males. If ‘female’ is one, then the model will show how the change in risk compared to males, if all over variables are equal.

In software like SPSS you would state which variable is the reference variable, in R it would use the first one. In our example we simply drop the reference variable.

Here we drop the male variable and if the passenger boarded in Cherbourg. So if all the other embarked variables are zero and sex is not female then the prediction is for males boarding in Cherbourg.

Intercept

Next we need to add an intercept column. This is a new column with all the values set to one. The intercept is the predicted outcome if all the other variables are zero. Here I’ve called it ‘_intercept’.

Split the data into x and y

The last thing we will do with the data is split it into two parts, it just makes it easier later on. Here we split the explanatory variable (survived) as y and the independent variables (everything else) as x. Therefore we can predict y (survived) based on x (variables).

You can view them by simply running x or y.

The Logistic Regression Model

Wow, after so much playing with the data we can finally build a model. That’s data science for you, 80% cleaning data, 20% model building.

That was easy ūüôā

Lets view the model…

And you should see this…

Titanic 1 – Data Exploration

In this section we are going to explore the dataset. This is an important step in the statistical/machine learning process as firstly we need to know more information about the data we are using and secondly we need to make a few alterations to the data itself.

This tutorial is part of the Titanic Project, you need to read that first..!

Make sure you have set up your workspace and grabbed the data, as explained here.

Next you will need to create a new notebook file in the titanic folder. I’ve called mine ‘Data Clean and Explore’.

Load the packages we need

For this part of the project we are going to need the following…

  • Pandas – for loading and storing the data

Now lets load them…

I’ve added a couple of extra lines, pd.set_option sets some pandas options (as I type this I realise how obvious this is). We are limiting the number of rows to 10 but increasing the number of columns to 101. I find it more useful to see a lot of columns.

Load the data

So now we need to load the data we have in. I always try to name my dataframes as df. It’s easier to write, most commonly used when looking at guides on the internet, etc, etc. You have to think a little harder when you have multiple dataframes, i.e. dfFood, dfPeople.

If this doesn’t work you either didn’t set up the folders correctly or you didn’t download the file correctly, or a bit of both.

You should be able to see all of the columns and 10 rows of data. The data means….

  • survival – Survival (0 = No; 1 = Yes)
  • pclass – Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • name – Name
  • sex – Sex
  • age – Age
  • sibsp¬† – Number of Siblings/Spouses Aboard
  • parch – Number of Parents/Children Aboard
  • ticket – Ticket Number
  • fare – Passenger Fare
  • cabin – Cabin
  • embarked – Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

What does all this tell use? First we need to know the question or focus of the work, we are building a predictive model. We are also interested on finding out what features influenced survival chances. An example can be seen when asking, ‘are men less likely to survive then women’? Remember this was all before Equality Act 2010, so ‘women and children first’..!

So in our work we are going to need to make some changes to the data, and delete some columns.

Data cleaning and manipulation

Quickly view the feature names.

Next we are going to drop ‘PassengerId’, ‘Name’ and ‘Ticket’. The reasons for these are…

  • PassengerId – this is a unique incrementing number, it should not influence the ‘risk’
  • Name – I don’t really want to know if people called Bob are more likely to survive than someone called Dave, plus we don’t have enough data to support this.
  • Ticket – Looks messy, and for similar reasons to PassengerId

So we drop them…

Next we need to look at the Cabin. We are going to be lazy in our example and change it to a Yes or No variable. So did they have a cabin or not. The better way of doing this could be split the cabin value to get the first letter as usually this represents which deck the cabin was on. This could be useful, this can be your homework ūüėČ

Just to test, we can see which passengers did not have a cabin by using this line…

We can see that there are 687 people who did not have a cabin.

Next lets create a new variable called ‘HasCabin’, this will be initially set as a NA, then set as 1 for having a cabin and 0 for not having one.

We can see the counts using…

Showing 687 do not have a cabin and 204 who do.

Next we drop the original Cabin column..

We should see our dataset looking nice and clean, but there is still more to do.

Create the dummy variables

Yes, creating dummy variables is a real term, I didn’t just make it up. Dummy variables essentially change categorical variables into numbers. Statistical models and machine learning models don’t like text. (I’ll try to write a short post on ordinal and nominal categorical variables sometime).

We need to create dummy variables for Embarked and Sex. This is easy with Pandas.

Now we just need to join them to the df dataframe.

Now we should see the dataset with 891 rows and 14 columns.

Next we need to do now is drop the original columns.

Only keep complete cases

Another thing that the statistical models and machine learning models don’t like is missing values, so we will only keep the rows which are complete.

We have now gone form 891 rows to 714 rows.

Save the data

We are going to use this data for the next parts, so we need to save the data out.

We’ve saved the dataframe as Titanic_Clean.csv in the Data folder.

Next

See Part 2 – Visualisation or Part 3 – Logistic Regression.

Titanic Project

Here we look at different predictive models using Python for data science and machine learning. The aim is to run through some different methods to develop a basic understanding of the workflow for producing different explanatory and predictive models.

Setup the workspace and get the data

Like all of my projects, first you need to set up a root folder. So created a folder call ‘Titanic’, this folder should be used for all you .ipynb files (if you are using Jupyter Notebook) or .py files (if you are just using a python IDE). Next, in the Titanic folder create a folder call ‘Data’, this is where we will store the data file.

The data we are using here comes from Kaggle, you can read all about it here. For this work you should download this version, it is exactly the same but just renamed to Titanic.csv. Download the file to the Data folder.

Next you can have a look at the following parts…

Essential Python Packages

Here is a list of my essential packages you will need for data science with Python. I will expand the list as I go.

Anaconda Python

This is my favourite flavour of Python. It contains everything you need for data analysis and machine learning. The main things I use which are included are…

  • Jupyter Notebook
  • SciKit-Lean
  • Matplotlib
  • Pandas
  • Numpy

You can get it here.

Seaborn: statistical data visualization

This is a fantastic package for visualising your data. It has a good mix of some statistical features too. Have a look here.

To install it you need to open command prompt/terminal and run this (required PIP)…

 

K Means Clustering in Python

K Means clustering is an unsupervised machine learning algorithm. An example of a supervised learning algorithm can be seen when looking at Neural Networks where the learning process involved both the inputs (x) and the outputs (y). During the learning process the error between the predicted outcome (predY) and actual outcome (y) is used to train the system. In an unsupervised method such as K Means clustering the outcome (y) variable is not used in the training process.

In this example we look at using the IRIS dataset and cover:

  • Importing the¬†sample IRIS dataset
  • Converting the dataset to a Pandas Dataframe
  • Visualising the classifications using scatter plots
  • Simple performance metrics

 

Requirements: I am using Anaconda Python Distribution which has everything you need including Pandas, NumPy, Matplotlib and importantly SciKit-Learn. I am also using iPython Notebook but you can use whatever IDE you want.

Setup the environment and load the data

Bring in the libraries you need.

Next load the data. Scikit Learn has some sample datasets, by previously importing the datasets we can use the following

You can view the data running each line individually.

I like to work with Pandas Dataframes, so we will convert the data into that format. Note that we have separated out the inputs (x) and the outputs/labels (y).

Visualise the data

It is always important to have a look at the data. We will do this by plotting two scatter plots. One looking at the Sepal values and another looking at Petal. We will also set it to use some colours so it is clearer.

Iris_Original

Build the K Means Model

This is the easy part, providing you have the data in the correct format (which we do). Here we only need two lines. First we create the model and specify the number of clusters the model should find (n_clusters=3) next we fit the model to the data.

Next we can view the results. This is the classes that the model decided, remember this is unsupervised and classified these purely based on the data.

You should see something like this…

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0,
2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

Visualise the classifier results

Lets plot the actual classes against the predicted classes from the K Means model.

Here we are plotting the Petal Length and Width, however each plot changes the colors of the points using either c=colormap[y.Targets] for the original class and  c=colormap[model.labels_] for the predicted classess.

The result is….

Iris_Model1

Ignore the colours (at the moment). Because the the model is unsupervised it did not know which label (class 0, 1 or 2) to assign to each class.

The Fix

Here we are going to change the class labels, we are not changing the any of the classification groups we are simply giving each group the correct number. We need to do this for measuring the performance.

Using this code below we using the np.choose() to assign new values, basically we are changing the 1’s in the predicted values to 0’s and the 0’s to 1’s. Class 2 matched so we can leave. By running the two print functions you can see that all we have done is swap the values.

NOTE: your results might be different to mine, if so you will have to figure out which class matches which and adjust the order of the values in the np.choose() function.

Re-plot

Now we can re plot the data as before but using predY instead of model.labels_.

Iris_Model_Corrected

No we can see that the K Means classifier has identified one class correctly (red) but some blacks have been classed as greens and vice versa.

Performance Measures

There are a number of ways in which we can measure a classifiers performance. Here we will calculate the accuracy and also the confusion matrix.

We need to values y which is the true (original) values and predY which are the models values.

Accuracy

My result was 0.89333333333333331, so we can say that the model has an accuracy of 89.3%. Not base considering the model was unsupervised.

Confusion Matrix

My results are…

array([[50, 0, 0],
[ 0, 48, 2],
[ 0, 14, 36]])

Hopefully the table below will render correctly, but we can summaries a the confusion matrix as shown below:

  • correctly identifed all 0 classes as 0’s
  • correctly classified 48 class 1’s but miss-classified 2 class 1’s as class 2
  • correctly classified 36 class 2’s but miss-classified 14 class 2’s as class 1

Real Class
0 1 2
Predicted Class 0 50 0 0
1 0 48 2
2 0 14 36

The confusion matrix also allows for a wider range of performance metrics to be calculated, see here for more details: https://en.wikipedia.org/wiki/Confusion_matrix.

 

Python Perceptron (Re-visited)

I feel like I’ve been staring at RStudio for way too long, so I’ve decided to give Python (SciKit-Learn) another go. I really need to recommend Anaconda Python¬†for this, it contains everything you need for Scientific Python coding, including…

  • Python 2.7 (3.x is available)
  • SciKit-Learn – Everything machine learning related
  • Pandas – Dataframes (all kinds of dataframe stuff)
  • Matlibplot – The plotting¬†library
  • iPython Notebook – A web based IDE, I like using this
  • Spyder – A nice IDE, I’m still getting to grips with it

Seriously, just get Anaconda Python, it is FREE.

I have done  previous post on the exact same problem however this uses DataFrames and is hopefully a little neater. There is some tweaks to plotting the hyperplane / decision boundary.

Lets get started

 Load in the data

Wow look at the DataFrame in action. We have 3 columns, A, B and Targets. A and B are just the input values. The target is a dichotomous value of 0 or 1, this could represent No or Yes, Product A or Product B, Dead or Alive, etc.

Plot the data (optional)

I always like to plot the data, I think its good practice to see what you are doing.

You should see this. If not, you might have forgotten the inline thing (above) or your install of Python is missing something.

Perceptron 1

Plotted data

Build the model and train

View the coefficients (optional)

I like to see what is going on

Plot the hyperplane / decision boundary

You should see this…

Perceptron 2

A scatter graph showing the data and the hyperplane. Everything to the right it a 1, everything to the left is a 0.

Using the system to make a prediction (and a confusion matrix)

Really we should be passing different data to it here, but here we can see the code to use the perceptron to predict the outcome only based on the inputs (in our case A and B).

The code also outputs a confusion matrix, it looks horrible.

Homework

Try this data to…

  • Use in the prediction of this model, how well does the system perform?
  • Rebuild the full model using this data
    • See how the hyperplane has moved?

 

Basic Imputation in R

Impin' ain't easy

Hello and welcome to another R Stats adventure (using the Stampy Longnose Voice). Today we’re looking at Imputation, or the guessing/estimation of missing values in data. Be warned, once you impute data you bias your findings. Motivation:¬†in my main data set I started with 6,000 records, reduced to 800 (selection criteria) and of these 400 had missing values. So my 6,000 record data set become only worth 400.

Read more

R vs Matlab vs Python (My Answer)

My Head Hurts

So some time back I started an ongoing post trying to compare R, Matlab and Python. Well my answer is simple, there is no answer. And if you think differently you should ask yourself, “am I just being a fan-boy”?

If you want to read my previous post it is here, if not here is¬†a quick summary of all 3…

Matlab

  • Proprietary meaning you will have to pay
  • Powerful
  • GUI’s for stuff. You can to build a ANN there’s a GUI, fancy a blast with Fuzzy Logic, guess what, there’s a GUI.
  • Lots of toolboxes (but you pay for)
  • Has Simulink, cool for rapid experimenting. Plug in a webcam and you can be object tracking in minuts (maybe 100 minutes, but still minutes)
  • I hate how the windows go behind other windows¬†(sorry, had to be said)
  • Plenty of webinars

R / RStudio

  • Not Proprietary, everything is free
  • Rapidly became HUGE, like its everywhere. Want a job in machine learning or to be taken seriously in statistics? Then you need R on your CV..! (Facebook, Microsoft, Google all want R)
  • No GUIs, now that is painful. It just makes everything that little bit harder to see what is going on.
  • Lots of toolboxes (but called packages) and they are all FREE
  • Too many toolboxes, yep also a curse. You¬†always find a couple of toolboxes doing the same thing, which is best?
  • RStudio, makes it so much more user friendly¬†than the standard R environment. Don’t even try R without RStudio, seriously just don’t..!
  • Why am I still using <- when I know = works?

Python (more SciKit Learn really)

  • Rapid development
  • Open source
  • Multiple environments (Spyder and Notebook are my favourite)
  • Grown¬†in strength
  • I have to question, will it replace R? I don’t know, some people love it, others like R. We’ll have to see
  • Syntax like you’ve never seen before, seriously my tab key has worn down..!
  • Maybe getting over complex. You’ll need get to grips with Pandas and NumPy. I found handling data formats a bit of a pain.
  • Matlibplot outputs look a little naff, maybe I needed to play with it more
  • Some good Deep Learning stuff out there (thinking of Theano)
  • Finally, Anaconda, you need this distribution of Python.

So that’s it, all 3 are good. It depends on what you want to use it for. My fan-boy opinion is currently R,¬†looks good on the CV, has loads of packages and the graphs look nice. Also, sooo much support for everything you want to do.

Data Science ‘up North’

Data-science

So about 2 years ago I reached out to a fellow MSc student through Linkedin who was studying¬†via Distance Learning. It turns out he was a chap based in San Francisco working at a big credit card company. His job title was ‘Data Scientist’ which I thought was novel. It turns out his job was developing methods and algorithms for detecting credit card thefts. We exchanged a couple of emails, general stuff about our plans after the course. He explained how the ‘data scientist’ role was starting to become big over there (USA). So I¬†wondered, what is a data scientist? Why don’t we have any jobs like that around here?

What I found was these types of jobs are popular in London with the big financial companies and the emerging high tech companies with salaries ranging¬†upwards of ¬£60k per annum.¬†Sounds a lot to a northerner but with higher living costs and ¬†as stated by Partridge (2002) “I guarantee you’ll either be mugged or not appreciated” it raises the question, is it worth the move?

Slightly further a field, Cambridge also seems to have a high demand for Data Scientists as the city attracts more and more research arms of companies such as Philips and Microsoft. But as we move further up north the demand starts to drop, why?

My thought process?

Is it because we don’t have any businesses which will benefit from a data scientist?

I started thinking about this and it is true, we don’t have the big financial institutions or the large ‘high tech’ companies. But we do have some big companies. Just thinking in Hull alone we have BP, BAE SYSTEMS, Smith and Nephew,¬†Reckitt Benckiser, ARCO and many more (I got bored thinking of more, sorry). ¬†But then we add another question, how many data scientists does one company need?

But then I also thought back to a couple of companies I have worked for, a door manufacturer (¬£4 million turn over), a ship repair company (now with about a ¬£8 million turn over). But neither of these companies would ever think about hiring a data scientist. But why? The answer is easy, they don’t what what a data scientist is or how they can help their business.

Lets look at me, I have a degree in IT and 15 years experience working in IT, including a couple of times as an IT manager. I’ve spent time managing databases, developing software with the aim of improving the company from a technical perspective. I’ve even taught computing¬†from school to degree level. But never hired, worked as or worked with a Data Scientist.

This leads on to my next question…

Is it because we don’t have the skill set (as a population)?

So my experience of education has made me question, what the hell are we teaching? Lets look at Level 3 provision in Hull. We have three colleges and now every academy has a 6th form but no-one is teaching skills to become a data scientist. As a city, everything computing seems to be focused on making games, often quoting how much the game industry is worth.

Whats your point?

Advertising you teach Game Making is great, but it’s a little like¬†advertising a course in Premiership Football Skills. You get a lot of hype among young people, you get a lot of applications for your course but very few¬†actually end up playing for a premiership team. So the next argument is, “but we teaching people how to make their own games, you know all entrepreneurial and that!”. Good and I fully agree with that, but who is that good for? The chap who sets up the games company and a few of his/her¬†mates? Look at¬†Mojang, sold Minecraft for $2 billion, employs 50 people, Notch takes his $2 billion and buggers of to LA. So is games good for the individual or the local economy?

Lets compare that with the potential of a¬†Data Scientist in Hull. Firstly as shown in the illustration somewhere on this page, a data scientist is multi-disciplinary. You need the data analyst skills, the computer science skills and business/context awareness. Your job, go into a business and make a real change. Look at the data, investigate why things are happening, questioning all the time, looking for statistical relevance. Going a step further with predictive modelling, ‘if we change this, what will happen?’, even further with machine learning. Moving away from just storing data to actually using it. So whats good¬†for the economy? Well for a start companies will be able to understand their businesses¬†in more depth, make them more efficient and possibly reduce costs. Companies in the local area will start catching up with the higher tech companies who are already employing data scientists.

My Simple Example (hopefully you’ll get it)

It felt like I ended up ranting, so I’ll finish with a little example I learnt from a school.¬†In this school the teachers fill in a spreadsheet of pupils performance, they pass the spreadsheet to the deputy headteacher who¬†transfers the data to a computer program. The computer system outputs the pupils target for that year, in other words¬†what level they should be at. A couple of points….

  • None of the staff know how it works, it just works
  • The targets are fairly accurate with the pupils
  • The school can now track student progress more accurately
  • Intervention can be planned
  • Pupil performance is raised
  • To develop it a¬†data scientist:
    • researched¬†the domain
    • managed/manipulated¬†the data
    • developed a statistically significant model
    • coded the software
    • tested¬†it
    • supported¬†the solution

Such a simple example but no-one is teaching the skills to be able to do this, whilst others are not seeing the need.

As promised a picture

DataScientist