Hello and welcome to another R Stats adventure (using the Stampy Longnose Voice). Today we’re looking at Imputation, or the guessing/estimation of missing values in data. Be warned, once you impute data you bias your findings. **Motivation:** in my main data set I started with 6,000 records, reduced to 800 (selection criteria) and of these 400 had missing values. So my 6,000 record data set become only worth 400.

So lets get started. We will be using the UCI Hepatitis data set which can be found here. You only need the *.data file. Please save this into a folder call **data**.

Lets load the data and add column names to the data frame.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# Source: https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/ # Load the data data <- read.csv("data//hepatitis.data") # Set the col names colNames <- c( "Class", "Age", "Sex", "Steroid_YN", "Antivirals_YN", "Fatigue_YN", "Malaise_YN", "Anorexia_YN", "Liver_Big_YN", "Liver_Frim_YN", "Spleen_Palpable_YN", "Spiders_YN", "Ascites_YN", "Varices_YN", "Bilirubin", "ALK_Phosphate", "SGOT", "Albumin", "Protime", "Histology_YN" ) colnames(data) <- colNames names(data) |

Next we need to examine the data, we will be focusing on Bilirubin.

1 2 3 4 5 |
# See the structure str(data) # Lets look at the data.... plot(data$Bilirubin) |

The plot looks like this…

Notice the value **‘?’**, this shouldn’t be here. In this data set it will be how NAs or Missing Values have been recorded. We need to remove this character.

1 2 3 4 5 6 7 8 |
# Get the row id of the ones with ? idx <- which(data$Bilirubin == "?") # Remove these rows (there was 6) data <- data[-idx,] # Lets look at the data.... plot(data$Bilirubin) |

The plot still shows the ? but it has zero records. This is still problem because this shows the data was imported as a factor rather than numerical. This can be seen when looking at the structure again.

1 2 |
# Look at the structure again, focus on the Bilirubin line str(data) |

So lets convert it. And the conversion is not as simple as you’d think. You can not convert it straight to a number, this would only bring the lookup values to the factor. So you need to convert to character first then convert this to a number. Easily done in R.

1 2 |
# Convert Bilirubin to Numeric (but first character otherwise it messes up) data$Bilirubin <- as.numeric(as.character(data$Bilirubin)) |

Now lets have a look at the plots from the data. Note the first plot is the same code as the previous one, but because it is now numeric we get a scatter plot.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Change the plot area size par(mfrow=c(2,2)) # Plot the data plot(data$Bilirubin) # Lets plot the density, looking if it looks normally distributed. plot(density(data$Bilirubin)) # It isn't, so lets look if the log transformed data is any better plot(density(log10(data$Bilirubin))) # It is, slightly. Might as well check a QQ Plot qqnorm(log10(data$Bilirubin)) qqline(log10(data$Bilirubin)) |

This should output this…

This plot shows use a couple of things. Top Right is a density plot showing the data is left skewed and not normally distributed. Bottom Left shows the same but with the data transformed using Log10 (normal for medical data). Bottom Right shows the QQ Plot. In short the data is not normally distributed (p-value < 0.05). Oh well at least we know.

So now lets copy the original data (as a backup), set 30% of them to NAs and put a marker is so we know which values are imputed. I picked 30% because my supervisor recommended a general rule of not imputing more that 30%, so this seemed like an extreme value to test.

1 2 3 4 5 6 7 8 9 10 11 12 |
# Copy the values for later data$Bilirubin_Original <- data$Bilirubin # Lets remove 10% of the values (148 so remove 15) # Create 15 random numbers (row id's) idx <- sample(1:length(data$Bilirubin), 45, replace=FALSE) # Set these rows to NA data$Bilirubin[idx] <- NA # Put a marker is so we know which ones are the test records data$TestRow <- 0 data$TestRow[idx] <- 1 |

**Really Important Note:** from this point on your plots and values will look different to mine. By using *sample()* we selected a random range.

We can plot the density using the code below, notice the extra bit na.rm=TRUE. In a density plot you need to tell it to ignore NAs.

1 2 3 4 |
# View the distriubitions plot(density(data$Bilirubin, na.rm=TRUE)) lines(density(data$Bilirubin_Original, na.rm=TRUE), col="red") legend("topright", c("Data with Missing", "Original Data"), col=c("black", "red"), lty=1, inset=0.02) |

The Result is show below. We can see that removing some values it effected the distribution of the data.

## Imputation Time

To start with, you will need to download and install the **Hmisc** package. (I should write a quick post about this for the noobs). Next you will need to call it…

1 2 |
# Load the library library(Hmisc) |

Next we are going to create two new columns one for Median Imputation (data$Bilirubin_Imp_Median) and another for Random Imputation (data$Bilirubin_Imp_Rand). There are populated using the impute() function as shown below…

1 2 3 4 5 6 |
data$Bilirubin_Imp_Median <- with(data, impute(Bilirubin, median)) data$Bilirubin_Imp_Rand <- with(data, impute(Bilirubin, 'random')) # View the imputed values for the ones we set as missing. data$Bilirubin_Imp_Median[data$TestRow == 1] data$Bilirubin_Imp_Rand[data$TestRow == 1] |

Note: all of the values in these columns have been imputed, we need to put the original values into these columns.

1 2 3 4 5 6 |
# Cool, the only problem now is that all th # data in Bilirubin_Imp_Median and Bilirubin_Imp_Rand # in imputed vales, so this will effect testing. # Fix, use the non-imputed values (replace). data$Bilirubin_Imp_Median[data$TestRow == 0] <- data$Bilirubin[data$TestRow == 0] data$Bilirubin_Imp_Rand[data$TestRow == 0] <- data$Bilirubin[data$TestRow == 0] |

Congratulations, you have now imputed the missing values using both Random and Median.

## See what has happened

We can now have a look at the distribution of the values and compare them to the original data.

1 2 3 4 5 6 7 8 9 10 11 |
# Plot Original (Black) plot(density(data$Bilirubin_Original, na.rm=TRUE), ylim=c(0,1), main="Density of Bilirubin Values (30% Imputed)") # Add Random (Blue) lines(density(data$Bilirubin_Imp_Rand, na.rm=TRUE), col="blue") # Add Median (Green) lines(density(data$Bilirubin_Imp_Median, na.rm=TRUE), col="green") # Add a legend legend("topright", c("Original Data", "Imputed - Rand", "Imputed - Median"), col=c("black", "blue", "green"), lty=1, inset=0.02) |

Remember, yours will look different to mine.

We can see that when the distributions of the values (with 30% imputed) are compared the median method shown a large change. However random imputation is very similar.

When using a Mann-Whitney / Wilcoxon test both have p-values < 0.05.

1 2 3 |
wilcox.test(data$Bilirubin_Original, data$Bilirubin) wilcox.test(data$Bilirubin_Original, data$Bilirubin_Imp_Rand) wilcox.test(data$Bilirubin_Original, data$Bilirubin_Imp_Median) |

## So what now?

This is just a starting point for imputing missing values. Next I plan to do some tutorials on other methods as I would say Random and Median are really not the best methods to use on real data.

I hope you enjoyed this, if so please follow me on Twitter @johnstamford for more updates.