Well I’m back from a fantastic course at the University of Hull Scarborough campus titled Statistical Programming in Rand thought it was about time I shared a tutorial. So lets have a look at Linear Regression, then next we can look in more depth at Logistic Regression (and maybe Logistic Regression Classifiers.

For this we’ll be using a dataset from the UCI Machine Learning Repository (also see: all data sets). Since I’m looking at Heart Failures we might as well use the Heart Disease set.

### Load the data

A shown there is a couple of ways to load the data. We’ll also add the column names (info).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Clear the workspace rm(list = ls()) # ----------------- # Load the data # ----------------- ds <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data") # OR ds <- read.csv(file.choose()) # ----------------- # Add the column names names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num") colnames(ds) <- names |

### Inspect the data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# ----------------- # lets inspect the data # to see what we have # ----------------- # See the structure str(ds) # plot everything plot(ds) # we have 14 variables but only 12 are numeric # so lets make enough space for subplots par(mfrow=c(3,4)) # loop through the data and # plot the density if numeric for(i in names(ds)){ if(is.numeric(ds[,i])){ plot(density(ds[,i]), main=i) } } # ----------------- |

Here is what the density plot should look like.

1 2 3 4 5 6 |
# Same again, but this time lets plot Histogram for(i in names(ds)){ if(is.numeric(ds[,i])){ hist(ds[,i], main=i) } } |

Giving us this…

I should be really looking at what the inputs are, but this is just a quick demo. So my first thoughts are **oldpeak** looks like a left skewed distribution, we can maybe fix that with log transform, exang, slope, num, restcd, fbs, sex, cp look more like factors so I’ll just leave them. (NOTE: I should really be looking at what the data is).

### Log Transforms

1 2 3 4 5 6 7 8 9 10 11 12 |
# ----------------- # Log Transforms # ----------------- # Area, 2 plots (before and after) par(mfrow=c(1,2)) hist(ds$oldpeak) hist(log(ds$oldpeak)) # Looks good so lets overwrite the originals ds$oldpeak <- log(ds$oldpeak) # ----------------- |

So the data is ready (ish).

# Work in progress 🙂

Sorry all, PhD supervisor meeting. I’ll finish either today or Thurs/Fri.