A Guide to using R – FREE eBook

I wrote a guide on using R and promised to release it once it had been approved. Anyways, here it is.

It focuses on the basics of R to get you started and includes…

  • The Basics
  • Entering Data
  • Selecting Data
  • Installing packages
  • Pots
  • Statistical Methods (basic)
  • Calling other scripts
  • Loops
  • If Statement
  • Creating a Function

Either click the huge download button or click Guide To Using R.

Please reference as…

Stamford, J. (2015). Guide to using R. 1st ed. [ebook] Hull. Available at: http://stamfordresearch.com/a-guide-to-using-r-free-ebook/ [Accessed 8 Sep. 2015].

 

Download Button

 

 

 

Basic Imputation in R

Impin' ain't easy

Hello and welcome to another R Stats adventure (using the Stampy Longnose Voice). Today we’re looking at Imputation, or the guessing/estimation of missing values in data. Be warned, once you impute data you bias your findings. Motivation: in my main data set I started with 6,000 records, reduced to 800 (selection criteria) and of these 400 had missing values. So my 6,000 record data set become only worth 400.

Read more

R vs Matlab vs Python (My Answer)

My Head Hurts

So some time back I started an ongoing post trying to compare R, Matlab and Python. Well my answer is simple, there is no answer. And if you think differently you should ask yourself, “am I just being a fan-boy”?

If you want to read my previous post it is here, if not here is a quick summary of all 3…

Matlab

  • Proprietary meaning you will have to pay
  • Powerful
  • GUI’s for stuff. You can to build a ANN there’s a GUI, fancy a blast with Fuzzy Logic, guess what, there’s a GUI.
  • Lots of toolboxes (but you pay for)
  • Has Simulink, cool for rapid experimenting. Plug in a webcam and you can be object tracking in minuts (maybe 100 minutes, but still minutes)
  • I hate how the windows go behind other windows (sorry, had to be said)
  • Plenty of webinars

R / RStudio

  • Not Proprietary, everything is free
  • Rapidly became HUGE, like its everywhere. Want a job in machine learning or to be taken seriously in statistics? Then you need R on your CV..! (Facebook, Microsoft, Google all want R)
  • No GUIs, now that is painful. It just makes everything that little bit harder to see what is going on.
  • Lots of toolboxes (but called packages) and they are all FREE
  • Too many toolboxes, yep also a curse. You always find a couple of toolboxes doing the same thing, which is best?
  • RStudio, makes it so much more user friendly than the standard R environment. Don’t even try R without RStudio, seriously just don’t..!
  • Why am I still using <- when I know = works?

Python (more SciKit Learn really)

  • Rapid development
  • Open source
  • Multiple environments (Spyder and Notebook are my favourite)
  • Grown in strength
  • I have to question, will it replace R? I don’t know, some people love it, others like R. We’ll have to see
  • Syntax like you’ve never seen before, seriously my tab key has worn down..!
  • Maybe getting over complex. You’ll need get to grips with Pandas and NumPy. I found handling data formats a bit of a pain.
  • Matlibplot outputs look a little naff, maybe I needed to play with it more
  • Some good Deep Learning stuff out there (thinking of Theano)
  • Finally, Anaconda, you need this distribution of Python.

So that’s it, all 3 are good. It depends on what you want to use it for. My fan-boy opinion is currently R, looks good on the CV, has loads of packages and the graphs look nice. Also, sooo much support for everything you want to do.

Data Science ‘up North’

Data-science

So about 2 years ago I reached out to a fellow MSc student through Linkedin who was studying via Distance Learning. It turns out he was a chap based in San Francisco working at a big credit card company. His job title was ‘Data Scientist’ which I thought was novel. It turns out his job was developing methods and algorithms for detecting credit card thefts. We exchanged a couple of emails, general stuff about our plans after the course. He explained how the ‘data scientist’ role was starting to become big over there (USA). So I wondered, what is a data scientist? Why don’t we have any jobs like that around here?

What I found was these types of jobs are popular in London with the big financial companies and the emerging high tech companies with salaries ranging upwards of £60k per annum. Sounds a lot to a northerner but with higher living costs and  as stated by Partridge (2002) “I guarantee you’ll either be mugged or not appreciated” it raises the question, is it worth the move?

Slightly further a field, Cambridge also seems to have a high demand for Data Scientists as the city attracts more and more research arms of companies such as Philips and Microsoft. But as we move further up north the demand starts to drop, why?

My thought process?

Is it because we don’t have any businesses which will benefit from a data scientist?

I started thinking about this and it is true, we don’t have the big financial institutions or the large ‘high tech’ companies. But we do have some big companies. Just thinking in Hull alone we have BP, BAE SYSTEMS, Smith and Nephew, Reckitt Benckiser, ARCO and many more (I got bored thinking of more, sorry).  But then we add another question, how many data scientists does one company need?

But then I also thought back to a couple of companies I have worked for, a door manufacturer (£4 million turn over), a ship repair company (now with about a £8 million turn over). But neither of these companies would ever think about hiring a data scientist. But why? The answer is easy, they don’t what what a data scientist is or how they can help their business.

Lets look at me, I have a degree in IT and 15 years experience working in IT, including a couple of times as an IT manager. I’ve spent time managing databases, developing software with the aim of improving the company from a technical perspective. I’ve even taught computing from school to degree level. But never hired, worked as or worked with a Data Scientist.

This leads on to my next question…

Is it because we don’t have the skill set (as a population)?

So my experience of education has made me question, what the hell are we teaching? Lets look at Level 3 provision in Hull. We have three colleges and now every academy has a 6th form but no-one is teaching skills to become a data scientist. As a city, everything computing seems to be focused on making games, often quoting how much the game industry is worth.

Whats your point?

Advertising you teach Game Making is great, but it’s a little like advertising a course in Premiership Football Skills. You get a lot of hype among young people, you get a lot of applications for your course but very few actually end up playing for a premiership team. So the next argument is, “but we teaching people how to make their own games, you know all entrepreneurial and that!”. Good and I fully agree with that, but who is that good for? The chap who sets up the games company and a few of his/her mates? Look at Mojang, sold Minecraft for $2 billion, employs 50 people, Notch takes his $2 billion and buggers of to LA. So is games good for the individual or the local economy?

Lets compare that with the potential of a Data Scientist in Hull. Firstly as shown in the illustration somewhere on this page, a data scientist is multi-disciplinary. You need the data analyst skills, the computer science skills and business/context awareness. Your job, go into a business and make a real change. Look at the data, investigate why things are happening, questioning all the time, looking for statistical relevance. Going a step further with predictive modelling, ‘if we change this, what will happen?’, even further with machine learning. Moving away from just storing data to actually using it. So whats good for the economy? Well for a start companies will be able to understand their businesses in more depth, make them more efficient and possibly reduce costs. Companies in the local area will start catching up with the higher tech companies who are already employing data scientists.

My Simple Example (hopefully you’ll get it)

It felt like I ended up ranting, so I’ll finish with a little example I learnt from a school. In this school the teachers fill in a spreadsheet of pupils performance, they pass the spreadsheet to the deputy headteacher who transfers the data to a computer program. The computer system outputs the pupils target for that year, in other words what level they should be at. A couple of points….

  • None of the staff know how it works, it just works
  • The targets are fairly accurate with the pupils
  • The school can now track student progress more accurately
  • Intervention can be planned
  • Pupil performance is raised
  • To develop it a data scientist:
    • researched the domain
    • managed/manipulated the data
    • developed a statistically significant model
    • coded the software
    • tested it
    • supported the solution

Such a simple example but no-one is teaching the skills to be able to do this, whilst others are not seeing the need.

As promised a picture

DataScientist

Outlier removal in R using IQR rule

Outliers

In short outliers can be a bit of a pain and have an impact on the results. Grubbs (1969) states an outlier “is an observation point that is distant from other observations”. They can usually be seen when we plot the data, below we can see 1, maybe 2 outliers in the density plot. 2.5 is a clear outliers and 2.0 may or may not be.

Density Plot (outlier)

 

So, one of the ways that we can identify outliers is through the use of the Interquartile Range Rule  (IQR Rule). This sets a min and max value for the range based on the 1st and 3rd quartile.

Step 1, get the  Interquartile Range

IQR

Step 2, calculate the upper and lower values

MinIQR

MaxIQR

Step 3, remove anything greater than max, or less than min.

Step 4, enjoy…!

Doing it in R

I get bored repeating processes over and over again, so I sort of automated it in R. Lets have a look at the code…

That is it, this script should remove all the outliers for you. Just be aware of the names of the datasets and make sure you spell the column names correctly.

64bit R and 32bit Access Problem [Fixed]

So I’ve had a little problem. I needed to open an Access Database in RStudio. Sounds easy right? Well the problem was that I have MS Office 2007 along with the default OBDC drivers, these are 32bit. I’m also using R and RStudio 64bit. They don’t work well together.

I looked for the 64 bit driver and found it here: Microsoft Access Database Engine 2010 Redistributable.

The next problem is, it doesn’t look like you can install this whilst you have Office 2007 installed. I read you can uninstall Office 2007, install the new driver and then re-install Office 2007.

I didn’t try that, I simply ran the new installer from command prompt with the /passive argument.

How to install without messing about…

  1. Download the driver above.
  2. Open up command prompt, ctrl+r then type cmd.
  3. Use dir to navigate to where you downloaded the driver.
  4. Type “AccessDatabaseEngine_x64.exe /passive” (without quotes) and hit Enter.
  5. Enjoy

 

Linear Regression

Well I’m back from a fantastic course at the University of Hull Scarborough campus titled Statistical Programming in Rand thought it was about time I shared a tutorial. So lets have a look at Linear Regression, then next we can look in more depth at Logistic Regression (and maybe Logistic Regression Classifiers.

Read more

No post December

Just realised I haven’t updated the blog this month. Which seems odd.

Well I’ve been pretty busy, started the month with a interim project review/update meeting. It felt like a pretty big deal but it was good to have my first paper drafted which showed really good progress.  The hardest part is balancing the ideas of multiple experts in the field who all think the project can go in different ways. In this review I have two professors, two senior academics and my very senior (and experienced) industrial supervisor.

Read more

R vs Matlab vs Python

Background

This post is planned to be an ongoing thought process. I’ve used Matlab when doing my MSc in Intelligent Systems and Robotics at De Montfort University Centre for Computational Intelligence. When I started using it I thought ‘wow this is cutting edge’ and enjoyed using it (apart from constant alt-tab between windows). So far I’ve used Matlab for:

  • Developing (and teaching) Fuzzy Log, both GUI and code
  • Developing (and teaching) Artificial Neural Networks (Perceptron, Pattern Net, etc) using the KDD 1999 Network Data
  • Robotics Simulation (iRobot Create)

Read more