In short outliers can be a bit of a pain and have an impact on the results. Grubbs (1969) states an outlier *“is an observation point that is distant from other observations”*. They can usually be seen when we plot the data, below we can see 1, maybe 2 outliers in the density plot. 2.5 is a clear outliers and 2.0 may or may not be.

So, one of the ways that we can identify outliers is through the use of the Interquartile Range Rule (IQR Rule). This sets a min and max value for the range based on the 1st and 3rd quartile.

Step 1, get the Interquartile Range

Step 2, calculate the upper and lower values

Step 3, remove anything greater than *max,* or less than *min*.

Step 4, enjoy…!

## Doing it in R

I get bored repeating processes over and over again, so I sort of automated it in R. Lets have a look at the code…

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# ------------------------------ # Load the data however you want # ------------------------------ # I called my data dsBase, here I copy the data to dsBase.iqr # I wanted to keep a copy of the original data set dsBase.iqr <- dsBase # Create a variable/vector/collection of the column names you want to remove outliers on. vars <- c("ColName1", "ColName2", "ColName3", "etc") # Create a variable to store the row id's to be removed Outliers <- c() # Loop through the list of columns you specified for(i in vars){ # Get the Min/Max values max <- quantile(dsBase.iqr[,i],0.75, na.rm=TRUE) + (IQR(dsBase.iqr[,i], na.rm=TRUE) * 1.5 ) min <- quantile(dsBase.iqr[,i],0.25, na.rm=TRUE) - (IQR(dsBase.iqr[,i], na.rm=TRUE) * 1.5 ) # Get the id's using which idx <- which(dsBase.iqr[,i] < min | dsBase.iqr[,i] > max) # Output the number of outliers in each variable print(paste(i, length(idx), sep='')) # Append the outliers list Outliers <- c(Outliers, idx) } # Sort, I think it's always good to do this Outliers <- sort(Outliers) # Remove the outliers dsBase.iqr <- dsBase.iqr[-Outliers,] |

That is it, this script should remove all the outliers for you. Just be aware of the names of the datasets and make sure you spell the column names correctly.

You’ve saved me a lot of time – and not a little heartache – with the above code. (I also downloaded your introductory guide as I’m bound to learn something useful from it as well.) Thanks again.

Thanks for the feedback, I am happy it is helping 🙂