Close

# Linear Regression using Pandas (Python)

So linear regression seem to be a nice place to start which should lead nicely on to logistic regression. I’ve been given some tutorials/files to work through written for R, well based on my previous post (R vs Matlab vs Python) I decided to have a go at creating a Python version.

You can view the finished file here (iPython Notebook Version).

### NumPy or Pandas?

This was a big question to start with, I’ve done a bit with NumPy which means I’ve a little more experienced at handling NumPy data structures. However, Pandas seems to be getting more popular, I was only just reading for financial forecasting systems/developments are using it because of its time-series functionality. So Pandas it is, my moto: ‘always do it the hard way (because you’ll probably learn something new)’.

What is Linear Regression?

Have a look at this, I found it and skimmed through it. The headings and formula’s look good but let me know if its not.

### Let’s do this

First I just grabbed some alligator data, seem like a nice standardised example for linear regression. House prices seem pretty standard too, Justin Duke did a nice example of this (I even borrowed and modified his y axis calculation script, mine is better).

### The code

Import the bits we need in the script.

Load the data into a Pandas DataFrame, its kind of an extended NumPy data structure. Also make a copy using the Log values.

Next we need to build the models, this uses np.polyfit and results in two coefficients. These are essentially a multiple and a bias.

Below is the line I borrowed and modified from Justin Duke.

It takes each x value (length) and calculates the y value (weight) based on the coefficients. It then returns the x value as r_x and the calculated y value as r_y. As it loops through the x length values (for i in data.length) it calculates the weight values (r_y) as shown…

r_x, r_y = (i, i x coefficient + coefficient)

Where i is the current length.

We then do this log transformed values…

### The Results

Finally we just plot the outputs. I did this using a subplot so both plots are on the same figure.

Note, don’t forget that the plots won’t show if you don’t include this line somewhere.

I have Spyder installed so it just opens up my graphs in a window, pretty cool.

Finally you are left with the graph showing two different linear regression models.