Impin' ain't easy

Basic Imputation in R

Hello and welcome to another R Stats adventure (using the Stampy Longnose Voice). Today we’re looking at Imputation, or the guessing/estimation of missing values in data. Be warned, once you impute data you bias your findings. Motivation: in my main data set I started with 6,000 records, reduced to 800 (selection criteria) and of these 400 had missing values. So my 6,000 record data set become only worth 400.

So lets get started. We will be using the UCI Hepatitis data set which can be found here. You only need the *.data file. Please save this into a folder call data.

Lets load the data and add column names to the data frame.

Next we need to examine the data, we will be focusing on Bilirubin.

The plot looks like this…

Imputation NAs

 

Notice the value ‘?’, this shouldn’t be here. In this data set it will be how NAs or Missing Values have been recorded. We need to remove this character.

The plot still shows the ? but it has zero records. This is still  problem because this shows the data was imported as a factor rather than numerical. This can be seen when looking at the structure again.

So lets convert it. And the conversion is not as simple as you’d think. You can not convert it straight to a number, this would only bring the lookup values to the factor. So you need to convert to character first then convert this to a number. Easily done in R.

Now lets have a look at the plots from the data. Note the first plot is the same code as the previous one, but because it is now numeric we get a scatter plot.

This should output this…

Imputation Test Plot

This plot shows use a couple of things. Top Right is a density plot showing the data is left skewed and not normally distributed. Bottom Left shows the same but with the data transformed using Log10 (normal for medical data). Bottom Right shows the QQ Plot. In short the data is not normally distributed (p-value < 0.05). Oh well at least we know.

So now lets copy the original data (as a backup), set 30% of them to NAs and put a marker is so we know which values are imputed. I picked 30% because my supervisor recommended a general rule of not imputing more that 30%, so this seemed like an extreme value to test.

Really Important Note: from this point on your plots and values will look different to mine. By using sample() we selected a random range.

We can plot the density using the code below, notice the extra bit na.rm=TRUE. In a density plot you need to tell it to ignore NAs.

The Result is show below. We can see that removing some values it effected the distribution of the data.

Imputation With Data Removed

 

Imputation Time

To start with, you will need to download and install the Hmisc package. (I should write a quick post about this for the noobs). Next you will need to call it…

Next we are going to create two new columns one for Median Imputation (data$Bilirubin_Imp_Median) and another for Random Imputation (data$Bilirubin_Imp_Rand). There are populated using the impute() function as shown below…

Note: all of the values in these columns have been imputed, we need to put the original values into these columns.

Congratulations, you have now imputed the missing values using both Random and Median.

See what has happened

We can now have a look at the distribution of the values and compare them to the original data.

Remember, yours will look different to mine.

Imputation Comparison

We can see that when the distributions of the values (with 30% imputed) are compared the median method shown a large change. However random imputation is very similar.

When using a Mann-Whitney / Wilcoxon test both have p-values < 0.05.

So what now?

This is just a starting point for imputing missing values. Next I plan to do some tutorials on other methods as I would say Random and Median are really not the best methods to use on real data.

I hope you enjoyed this, if so please follow me on Twitter @johnstamford for more updates.

Leave a Reply