So linear regression seem to be a nice place to start which should lead nicely on to logistic regression. I’ve been given some tutorials/files to work through written for R, well based on my previous post (R vs Matlab vs Python) I decided to have a go at creating a Python version.

You can view the finished file here (iPython Notebook Version).

### NumPy or Pandas?

This was a big question to start with, I’ve done a bit with NumPy which means I’ve a little more experienced at handling NumPy data structures. However, Pandas seems to be getting more popular, I was only just reading for financial forecasting systems/developments are using it because of its time-series functionality. So Pandas it is, my moto: ‘always do it the hard way (because you’ll probably learn something new)’.

What is Linear Regression?

Have a look at this, I found it and skimmed through it. The headings and formula’s look good but let me know if its not.

### Let’s do this

First I just grabbed some alligator data, seem like a nice standardised example for linear regression. House prices seem pretty standard too, Justin Duke did a nice example of this (I even borrowed and modified his y axis calculation script, *mine is better*).

### The code

Import the bits we need in the script.

1 2 3 4 |
# Import what you need import numpy as np import pandas as pd import matplotlib.pyplot as plt |

Load the data into a Pandas DataFrame, its kind of an extended NumPy data structure. Also make a copy using the Log values.

1 2 3 4 5 6 7 8 |
# Set the data data = pd.DataFrame ({ 'length' : [94,74,147,58,86,94,63,86,69,72,128,85,82,86,88,72,74,61,90,89,68,76,114,90,78], 'weight' : [130,51,640,28,80,110,33,90,36,38,366,84,80,83,70,61,54,44,106,84,39,42,197,102,57] }) # create another data frame of log values data_log = np.log(data) |

Next we need to build the models, this uses np.polyfit and results in two coefficients. These are essentially a multiple and a bias.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# ======================== # Model for Original Data # ======================== # Get the linear models lm_original = np.polyfit(data.length, data.weight, 1) # calculate the y values based on the co-efficients from the model r_x, r_y = zip(*((i, i*lm_original[0] + lm_original[1]) for i in data.length)) # Put in to a data frame, to keep is all nice lm_original_plot = pd.DataFrame({ 'length' : r_x, 'weight' : r_y }) |

Below is the line I borrowed and modified from Justin Duke.

1 |
r_x, r_y = zip(*((i, i*lm_original[0] + lm_original[1]) for i in data.length)) |

It takes each x value (length) and calculates the y value (weight) based on the coefficients. It then returns the x value as r_x and the calculated y value as r_y. As it loops through the x length values (for i in data.length) it calculates the weight values (r_y) as shown…

r_x, r_y = (i, i x coefficient[0] + coefficient[1])

Where *i* is the current length.

We then do this log transformed values…

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# ======================== # Model for Log Data # ======================== # Get the linear models lm_log = np.polyfit(data_log.length, data_log.weight, 1) # calculate the y values based on the co-efficients from the model r_x, r_y = zip(*((i, i*lm_log[0] + lm_log[1]) for i in data_log.length)) # Put in to a data frame, to keep is all nice lm_log_plot = pd.DataFrame({ 'length' : r_x, 'weight' : r_y }) |

### The Results

Finally we just plot the outputs. I did this using a subplot so both plots are on the same figure.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# ======================== # Plot the data # ======================== fig, axes = plt.subplots(nrows=1, ncols=2) # Plot the original data and model data.plot(kind='scatter', color='Blue', x='length', y='weight', ax=axes[0], title='Original Values') lm_original_plot.plot(kind='line', color='Red', x='length', y='weight', ax=axes[0]) # Plot the log transformed data and model data_log.plot(kind='scatter', color='Blue', x='length', y='weight', ax=axes[1], title='Log Values') lm_log_plot.plot(kind='line', color='Red', x='length', y='weight', ax=axes[1]) plt.show() |

Note, don’t forget that the plots won’t show if you don’t include this line somewhere.

1 2 |
# Needed to show the plots inline %matplotlib inline |

I have Spyder installed so it just opens up my graphs in a window, pretty cool.

Finally you are left with the graph showing two different linear regression models.