R vs Matlab vs Python (My Answer)

My Head Hurts

So some time back I started an ongoing post trying to compare R, Matlab and Python. Well my answer is simple, there is no answer. And if you think differently you should ask yourself, “am I just being a fan-boy”?

If you want to read my previous post it is here, if not here is a quick summary of all 3…

Matlab

  • Proprietary meaning you will have to pay
  • Powerful
  • GUI’s for stuff. You can to build a ANN there’s a GUI, fancy a blast with Fuzzy Logic, guess what, there’s a GUI.
  • Lots of toolboxes (but you pay for)
  • Has Simulink, cool for rapid experimenting. Plug in a webcam and you can be object tracking in minuts (maybe 100 minutes, but still minutes)
  • I hate how the windows go behind other windows (sorry, had to be said)
  • Plenty of webinars

R / RStudio

  • Not Proprietary, everything is free
  • Rapidly became HUGE, like its everywhere. Want a job in machine learning or to be taken seriously in statistics? Then you need R on your CV..! (Facebook, Microsoft, Google all want R)
  • No GUIs, now that is painful. It just makes everything that little bit harder to see what is going on.
  • Lots of toolboxes (but called packages) and they are all FREE
  • Too many toolboxes, yep also a curse. You always find a couple of toolboxes doing the same thing, which is best?
  • RStudio, makes it so much more user friendly than the standard R environment. Don’t even try R without RStudio, seriously just don’t..!
  • Why am I still using <- when I know = works?

Python (more SciKit Learn really)

  • Rapid development
  • Open source
  • Multiple environments (Spyder and Notebook are my favourite)
  • Grown in strength
  • I have to question, will it replace R? I don’t know, some people love it, others like R. We’ll have to see
  • Syntax like you’ve never seen before, seriously my tab key has worn down..!
  • Maybe getting over complex. You’ll need get to grips with Pandas and NumPy. I found handling data formats a bit of a pain.
  • Matlibplot outputs look a little naff, maybe I needed to play with it more
  • Some good Deep Learning stuff out there (thinking of Theano)
  • Finally, Anaconda, you need this distribution of Python.

So that’s it, all 3 are good. It depends on what you want to use it for. My fan-boy opinion is currently R, looks good on the CV, has loads of packages and the graphs look nice. Also, sooo much support for everything you want to do.

Data Science ‘up North’

Data-science

So about 2 years ago I reached out to a fellow MSc student through Linkedin who was studying via Distance Learning. It turns out he was a chap based in San Francisco working at a big credit card company. His job title was ‘Data Scientist’ which I thought was novel. It turns out his job was developing methods and algorithms for detecting credit card thefts. We exchanged a couple of emails, general stuff about our plans after the course. He explained how the ‘data scientist’ role was starting to become big over there (USA). So I wondered, what is a data scientist? Why don’t we have any jobs like that around here?

What I found was these types of jobs are popular in London with the big financial companies and the emerging high tech companies with salaries ranging upwards of £60k per annum. Sounds a lot to a northerner but with higher living costs and  as stated by Partridge (2002) “I guarantee you’ll either be mugged or not appreciated” it raises the question, is it worth the move?

Slightly further a field, Cambridge also seems to have a high demand for Data Scientists as the city attracts more and more research arms of companies such as Philips and Microsoft. But as we move further up north the demand starts to drop, why?

My thought process?

Is it because we don’t have any businesses which will benefit from a data scientist?

I started thinking about this and it is true, we don’t have the big financial institutions or the large ‘high tech’ companies. But we do have some big companies. Just thinking in Hull alone we have BP, BAE SYSTEMS, Smith and Nephew, Reckitt Benckiser, ARCO and many more (I got bored thinking of more, sorry).  But then we add another question, how many data scientists does one company need?

But then I also thought back to a couple of companies I have worked for, a door manufacturer (£4 million turn over), a ship repair company (now with about a £8 million turn over). But neither of these companies would ever think about hiring a data scientist. But why? The answer is easy, they don’t what what a data scientist is or how they can help their business.

Lets look at me, I have a degree in IT and 15 years experience working in IT, including a couple of times as an IT manager. I’ve spent time managing databases, developing software with the aim of improving the company from a technical perspective. I’ve even taught computing from school to degree level. But never hired, worked as or worked with a Data Scientist.

This leads on to my next question…

Is it because we don’t have the skill set (as a population)?

So my experience of education has made me question, what the hell are we teaching? Lets look at Level 3 provision in Hull. We have three colleges and now every academy has a 6th form but no-one is teaching skills to become a data scientist. As a city, everything computing seems to be focused on making games, often quoting how much the game industry is worth.

Whats your point?

Advertising you teach Game Making is great, but it’s a little like advertising a course in Premiership Football Skills. You get a lot of hype among young people, you get a lot of applications for your course but very few actually end up playing for a premiership team. So the next argument is, “but we teaching people how to make their own games, you know all entrepreneurial and that!”. Good and I fully agree with that, but who is that good for? The chap who sets up the games company and a few of his/her mates? Look at Mojang, sold Minecraft for $2 billion, employs 50 people, Notch takes his $2 billion and buggers of to LA. So is games good for the individual or the local economy?

Lets compare that with the potential of a Data Scientist in Hull. Firstly as shown in the illustration somewhere on this page, a data scientist is multi-disciplinary. You need the data analyst skills, the computer science skills and business/context awareness. Your job, go into a business and make a real change. Look at the data, investigate why things are happening, questioning all the time, looking for statistical relevance. Going a step further with predictive modelling, ‘if we change this, what will happen?’, even further with machine learning. Moving away from just storing data to actually using it. So whats good for the economy? Well for a start companies will be able to understand their businesses in more depth, make them more efficient and possibly reduce costs. Companies in the local area will start catching up with the higher tech companies who are already employing data scientists.

My Simple Example (hopefully you’ll get it)

It felt like I ended up ranting, so I’ll finish with a little example I learnt from a school. In this school the teachers fill in a spreadsheet of pupils performance, they pass the spreadsheet to the deputy headteacher who transfers the data to a computer program. The computer system outputs the pupils target for that year, in other words what level they should be at. A couple of points….

  • None of the staff know how it works, it just works
  • The targets are fairly accurate with the pupils
  • The school can now track student progress more accurately
  • Intervention can be planned
  • Pupil performance is raised
  • To develop it a data scientist:
    • researched the domain
    • managed/manipulated the data
    • developed a statistically significant model
    • coded the software
    • tested it
    • supported the solution

Such a simple example but no-one is teaching the skills to be able to do this, whilst others are not seeing the need.

As promised a picture

DataScientist

Google playing Atari (how?)

Pretty excited today, saw on the new that Google has made this cutting edge break through in making a ‘general agent’ which plays Atari games.

The bad news is, it is not breaking news. In fact it is over 2 years old.

The cool news is, I spent the summer working on improving some of the limitations for my MSc in Intelligent Systems and Robotics, and I am eager to talk about the project.

Background

The concept of playing Atari games with a general agent was originally was started by Deepmind. Deepmind was an unusual business, perhaps a business for future entrepreneurs, in which they did not produce anything. They simply got the best machine learning people together and researched really interesting topics, like Deep Reinforcement Learning. I’m partly guessing here, but there funding is likely to have come from industrial partners want to get in on cutting edge projects.

Anyway, Google bought Deepmind for a lot of money (See here) and now Google own their work and have continued to develop it further.

Machine Learning

So the system works through the use of Q-Learning which is a form of Reinforcement Learning (RL). RL is a semi supervised machine learning technique. The semi supervised aspect is that the system needs something to aim for, like a goal, or in this case a score. By getting a increase in score then the system was gone a good job and by loosing a life the system did a bad job.

In addition to this the system also takes pixel data from the screen. It processes this through a technique called a Convolutional Neural Network (CNN). CNNs are based on biological systems, specifically Cat’s eyes (and also Monkey’s eyes) and is based on the work of LeCun. This process extracts the features of an image with the aim of reducing the image data size whilst maintaining the features. Below you can see an example of this working with the MNIST character database (taken from LeCun’s website).

LeNet5

As shown, this model has 3 layers, as the layers get smaller less data is stored but features (which loose their pattern to us) are still present.

So how does it all work?

So now we know the two underlying processes how does it work? The first this to understand is that Q-Learning needs two things, the state and the action which are represented as Q(s,a). The output from the CNN is the state and the actions are the viable control actions (fire, move left, move right, etc).  So each state and action has a value, starting as 0.0000. Imagine a speadsheet, the left column is the possible states and the headings across are the actions.

Each time the system triggers/ticks/update (tried to do something) it checks if the state exists, if it doesn’t it adds the state to the list and sets all the actions to 0.000. It then tries a random action and the updates the Q value for that state/actions as discussed later. If the state does exist it will perform the action with the highest Q Value. I would also guess occasionally it would randomly try a different action that the one with the highest Q-Value, this would avoid problems (not discussed here).

The update rule for updating the Q Values is based on a couple of things. Firstly was there a reward (score increase), how big was the reward and then was series of states/actions helped achieve the reward. This last part is achieved through a delayed rewards as discussed in general Q Learning theories. This can be seen in this equation from Wiki (sorry fellow academics) and you have read how it works here.

QLearning

Summary

So the system works, we know it works and in short, it is awesome. They have developed an agent which only needs to be given a few things (reward, controls, states). From this is can learn to play games, and trust me it is awesome from a machine learning point of view and it really does work.

The system has further applications, just using a simple example of Advertising promotions. The state is your browsing history/pattern, the actions is what type of Ad to show you and the reward is obvious, did you click it. But this system can go beyond this, how about stock market trading?

Now the bad part they aren’t telling you. First, processing images through a CNN is computationally intensive. Doing it with a CPU is too slow however running it thought a GPU is a lot better. Secondly, the system builds up its ‘knowledge’ through random actions, this makes the system make a lot of mistakes in the training process. It starts with 0.000 for all actions and states and has to build this knowledge up. Imaging training a system with live data, all those mistakes. Thirdly, training takes ages, and I mean ages. Think how many possible combinations of events on the screen can be happening, even when CNNs reduce the state it is still massive.

Want to have a go?

If you want to have a go I recommend a few things….

Read this paper it is by the people at Deepmind titled – Playing Atari Games with Deep Reinforcement Learning

Have a look at the Arcade Learning Environment  (ALE) – Has papers, examples and it what Deepmind built their system with (Python and Java support)

You should also look at Theano for the convolutional neural network aspects. I recommend starting with the logistic regress part first though. And if you get as far as CNNs, enable CUDA if you have it.

Stella VCS is a great Atari 2600 emulator

This chap (Krist Jankorjus) had a go a recreating the Deepmind project for his MSc and put the code of Github. Warning, I don’t know how far he got BUT you can see how he interacts with ALE. A special big thanks to him for distributing the work via Githib.

How do I know all of this?

Easy, I did my MSc project trying to reduce the costly nature of the training process.

Outlier removal in R using IQR rule

Outliers

In short outliers can be a bit of a pain and have an impact on the results. Grubbs (1969) states an outlier “is an observation point that is distant from other observations”. They can usually be seen when we plot the data, below we can see 1, maybe 2 outliers in the density plot. 2.5 is a clear outliers and 2.0 may or may not be.

Density Plot (outlier)

 

So, one of the ways that we can identify outliers is through the use of the Interquartile Range Rule  (IQR Rule). This sets a min and max value for the range based on the 1st and 3rd quartile.

Step 1, get the  Interquartile Range

IQR

Step 2, calculate the upper and lower values

MinIQR

MaxIQR

Step 3, remove anything greater than max, or less than min.

Step 4, enjoy…!

Doing it in R

I get bored repeating processes over and over again, so I sort of automated it in R. Lets have a look at the code…

That is it, this script should remove all the outliers for you. Just be aware of the names of the datasets and make sure you spell the column names correctly.

64bit R and 32bit Access Problem [Fixed]

So I’ve had a little problem. I needed to open an Access Database in RStudio. Sounds easy right? Well the problem was that I have MS Office 2007 along with the default OBDC drivers, these are 32bit. I’m also using R and RStudio 64bit. They don’t work well together.

I looked for the 64 bit driver and found it here: Microsoft Access Database Engine 2010 Redistributable.

The next problem is, it doesn’t look like you can install this whilst you have Office 2007 installed. I read you can uninstall Office 2007, install the new driver and then re-install Office 2007.

I didn’t try that, I simply ran the new installer from command prompt with the /passive argument.

How to install without messing about…

  1. Download the driver above.
  2. Open up command prompt, ctrl+r then type cmd.
  3. Use dir to navigate to where you downloaded the driver.
  4. Type “AccessDatabaseEngine_x64.exe /passive” (without quotes) and hit Enter.
  5. Enjoy

 

Linear Regression

Well I’m back from a fantastic course at the University of Hull Scarborough campus titled Statistical Programming in Rand thought it was about time I shared a tutorial. So lets have a look at Linear Regression, then next we can look in more depth at Logistic Regression (and maybe Logistic Regression Classifiers.

Read more

No post December

Just realised I haven’t updated the blog this month. Which seems odd.

Well I’ve been pretty busy, started the month with a interim project review/update meeting. It felt like a pretty big deal but it was good to have my first paper drafted which showed really good progress.  The hardest part is balancing the ideas of multiple experts in the field who all think the project can go in different ways. In this review I have two professors, two senior academics and my very senior (and experienced) industrial supervisor.

Read more

Literature Review

The PhD thesis is apparently 80,000 words and approximately 10 chapters. Sounds a little daunting at first. I think I did 18,000 for my MSc project. We so far I’ve been plodding on with the literature review mainly looking at Home Telehealth Monitoring trials. At first it started getting messy with notes getting lost and swapping backwards and forwards between papers.

Read more

PhD Life so far

Just thought I’d do a quick post to reflect on the PhD process so far. Everything it going well, I think. I meet my third supervisor next week. I might have to do a presentation, I’ll probably find out about 10 minutes before I do. So I think I’m prepare one just in case. Nervous isn’t the best word to describe it, but  I was a little <whatever it’s called> before, then I google’d the chap and now I’m <whatever it’s called> a lot. No only a Professor of Cardiology but also one of the senior cardiology people in the country.

Read more

R vs Matlab vs Python

Background

This post is planned to be an ongoing thought process. I’ve used Matlab when doing my MSc in Intelligent Systems and Robotics at De Montfort University Centre for Computational Intelligence. When I started using it I thought ‘wow this is cutting edge’ and enjoyed using it (apart from constant alt-tab between windows). So far I’ve used Matlab for:

  • Developing (and teaching) Fuzzy Log, both GUI and code
  • Developing (and teaching) Artificial Neural Networks (Perceptron, Pattern Net, etc) using the KDD 1999 Network Data
  • Robotics Simulation (iRobot Create)

Read more

Data Linkage and Anonymisation Workshop

Just got back from a workshop held at the Turing Gateway to Mathematics in Cambridge. The event had a range of fantastic speakers discussing issues around data linkage and data privacy.

Data Linkage

Chris Dibben (University of Edinburgh) introduced the idea of linking different sets of data about subjects from multiple sources. He gave the example of how data is collected from pre-birth (pregnancy records) all the way through a persons life until death (or just after). All of this data is stored is different locations with no unique identifier.

Read more

1 2 3