My previous post ‘Outlier removal in R using IQR rule‘ has been one of the most visited posts on here. So now lets have a look at it in Python. This time we’ll be using Pandas and NumPy, along with the Titanic dataset. We will also do a little extra thing – log transform the data.

If you are really interested in identifying Outliers (or Novelty detection) I would recommend this paper and a good starting point….

Pimentel, M.A.F., Clifton, D.A., Clifton, L., Tarassenko, L., 2014. A review of novelty detection. Signal Processing.

This is just a quick example to get you started.

## Load the packages and the data

Here I have thrown in an extra couple of lines, pd.set_options are set to display only 10 rows but all columns. I always have this set up like this. We’re also using Seaborn (external link) for some box plots. I always import Seaborn and use set_style and set_context, this makes the Matplotlib plots look better (publication quality).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns MAX_ROWS = 10 pd.set_option('display.max_rows', MAX_ROWS) pd.set_option('display.max_columns', 200) sns.set_style("whitegrid") sns.set_context("paper") |

Next we load the data, as always mine is in saved in a sub folder called ‘Data’…

1 2 3 4 |
df = pd.read_csv('Data/Titanic.csv') # View df |

We should see the full dataset, **note:** it has 891 rows and 12 columns.

## Visualise the data

We are going to work with the **Fares** variable, so lets have a look at it…

1 2 3 4 5 6 7 8 9 10 11 |
i = 'Fare' plt.figure(figsize=(10,8)) plt.subplot(211) plt.xlim(df[i].min(), df[i].max()*1.1) ax = df[i].plot(kind='kde') plt.subplot(212) plt.xlim(df[i].min(), df[i].max()*1.1) sns.boxplot(x=df[i]) |

**Note:** here I am setting i as ‘Fare’. I do this sometimes to save me having to change it in many places if I want to view other variables. We are also setting the x axis to a min and max values based on the min and max of the variable.

You should see this…

Here we have two plots, the density plot and the box plot. This is a good way to view the data as we can see in the density plot (top) that there is some data points in the tails but it is difficult to see, however it is clear in the box plot (thank you Seaborn).

## Transform the data

Next we are going to butcher the data, I use the work butcher because I’m going to get rid of lots of rows just for this demonstration.

1 2 3 4 5 6 7 8 |
# Remove any zeros (otherwise we get (-inf) df.loc[df.Fare == 0, 'Fare'] = np.nan # Drop NA df.dropna(inplace=True) # Log Transform df['Log_' + i] = np.log(df[i]) |

So what have we done?

- We set all zero values in Fares to NaN. A zero can cause a problem when using a log transform
- We drop all the rows with a NaN, this is a bit extreme in this example (change df.dropna(inplace=True) to df.dropna(subset=[‘Fare’], inplace=True) to keep more data)
- We create a new variable called ‘Log_’ + i where i is ‘Fare’, so the new variable is Log_Fare

Plot the values as before changing i to ‘Log_Fare’…

1 2 3 4 5 6 7 8 9 10 11 |
i = 'Log_Fare' plt.figure(figsize=(10,8)) plt.subplot(211) plt.xlim(df[i].min(), df[i].max()*1.1) ax = df[i].plot(kind='kde') plt.subplot(212) plt.xlim(df[i].min(), df[i].max()*1.1) sns.boxplot(x=df[i]) |

And we get this….

## Determine the Min and Max

Next we need to determine the min and max cuttoffs for detecting the outliers. As discussed here we do this…

Step 1, get the Interquartile Range

Step 2, calculate the upper and lower values

1 2 3 4 5 |
q75, q25 = np.percentile(df.Log_Fare.dropna(), [75 ,25]) iqr = q75 - q25 min = q25 - (iqr*1.5) max = q75 + (iqr*1.5) |

We can visualise this using similar code as shown above by adding plt.axvline.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
i = 'Log_Fare' plt.figure(figsize=(10,8)) plt.subplot(211) plt.xlim(df[i].min(), df[i].max()*1.1) plt.axvline(x=min) plt.axvline(x=max) ax = df[i].plot(kind='kde') plt.subplot(212) plt.xlim(df[i].min(), df[i].max()*1.1) sns.boxplot(x=df[i]) plt.axvline(x=min) plt.axvline(x=max) |

## Finishing touches

Now lets identify the outliers. First we set a new variable in the dataframe called ‘Outlier’ defaulted to 0, then is a row is outside this range we set it to 1. **Note:** i should still be ‘Log_Fare’

1 2 3 4 |
df['Outlier'] = 0 df.loc[df[i] < min, 'Outlier'] = 1 df.loc[df[i] > max, 'Outlier'] = 1 |

Now we can plot the original data and the data without the outliers in (Clean Data).

## Summary

A quick breakdown of what we have done. We load the data into Python, remove any rows which has missing data. We then used a log transform to transform the data (ideally to a more Gaussian distribution). The we determined a min and max value and used that to identify which values are Outliers.