Now that you have seen linear regression in action using both Excel and Python, let’s try to figure out the maths behind it.
In simple terms, linear regression finds a linear relationship between variables. In our case, we used it to map out the relationship between the price (dependent variable) and the area of the house (independent variable). You can visibly see this linear relationship by observing the upward sloping trendline in the earlier charts. It basically says .. more area means a higher price. This is all done with maths, some of it you might even recall from your school days.
Remember this slope formula?
Yup, that’s right, that’s the formula of a line and we are going to use it find how to draw our best-fitting line! Here are it’s components.
- y is the dependent variable, the price
- m is the slope of the line
- x is the independent variable, the area
- b is the vertical offset from zero, called the y-intercept or bias
So given the above, the only known value right now is the x variable, the area. If we find out the m and b components for the above equation, we should be able to figure out the y, the price.
As a side note, you might come across the formula y = β0 + β1x, this is from the stats world and means the same thing. β0 represents the y-intercept and β1 represents the slope, sometimes referred to as the coefficient or slope coefficient.
Anyway, there are several approaches to figure out the line, let’s look at 2.
Let’s first start with the simplest approach. It’s really quick to do visually on a chart, you just draw a straight line between the y value of the smallest x with the y value of the biggest x. Something like this.
Now that we know what it looks like visually, we need to figure out a formula that we can use to predict the price, for any value of house area.
Finding the slope (m)
The simplest way to find m is as follows:
As you can see above, the formula attempts to measure the change in the x and the y-axis. Which means we need 2 values for both x and y. We should use the smallest and biggest x (area) value, fortunately, our data is already nicely sorted for us. So let’s try to figure out m with the first and the last house with the following values:
area = 326 sqm – price = $446,000
area = 500 sqm – price = $510,000
change in y = 510,000 – 446,000 = 64,000
change in x = 500 – 326 = 174
m = 64000/174 = 367.81
Finding the y-intercept/bias (b)
The y-intercept is a fancy way of saying that we need to know how far the starting y is from zero. We can do this using the first house in our list and plug in the value of m in the line formula and work our way backwards.
So with the first house, x = 326 and y = 446,000.
Using basic algebra, we get b = 326093.94, we can round it off to 326094 for convenience. For ease, here is how it looks like.
Congratulations, now that we know the values of m and b, we can start plugging in any value of x (house area) to find the y (house price). With the value of b and m, we can now predict the prices of the 375, 700 and 750 sqm houses.
375 sqm house
(367.81 * 475) + 326094 = $500,803.75
700 sqm house
(367.81 * 700) + 326094 = $583,561
750 sqm house
(367.81 * 750) + 326094 = $601,951.5
This is a good start but as you saw in the chart initially, this approach is too simplistic and takes only 2 data points into account. Also, it doesn’t pay attention to the outliers, generally, it would lead to significant errors between the real and predicted prices.
Still, this is reasonable progress but it’s too simplistic in nature We need a better way to figure out a line that minimises the error between all the points, not just the first and last. We can do this in various ways but let’s try a slightly more complex approach than the last one.
Least Squares Regression
As the name suggests, this method attempts to minimise the error across the line by making the total of the square of the errors as small as possible. Here is the formula for calculating m with the least squares method.
Finding the slope again (m)
Let n be the number of our data points, in our case, it would be 11.
I have done the calculations for you, here they are.
Plugging all of those values in the formula gives us m = 273.32. Notice that it is significantly lower than our first value of m.
Finding the y-intercept/bias again (b)
As we are using all of the values, we will alter the original formula a little and use mean. To do this, we will use sum of all x, sum of all y and then divide them by our number of records. Something like this.
Plugging the values in, we get b = 376,904.54, we will round it off to 376,905. Notice that this is significantly higher than our previous value of 326,094.
Let’s test our new values of b and m with our first and last house.
(273.32 * 326) + 376905 = $466,007.32
(273.32 * 500) + 376905 = $513,565.00
They are a bit on the higher end but that is because the formula was focused on all of the house, rather than just the first and last. This new model would now result in lower errors across the board.
We can now try to predict the prices of 475, 700 and 750 sqm houses and see what happens.
375 sqm house
(273.32 * 475) + 376905 = $506,732
700 sqm house
(273.32 * 700) + 376905 = $568,229
750 sqm house
(273.32 * 750) + 376905 = $581,895
You know what? Those are exactly the same as the ones we figured out using some Python code in the previous post! Here is what the line looks like.
I hope all this maths didn’t put you off. The point of this whole exercise was to show you that machine learning is no magic, it’s just maths.
Although linear regression is quite simplistic in nature and there are many other algorithms that can be used, it is a great starting point. If you have made it thus far, you have done a fantastic job!
So far, we have just used one independent variable (house area). Next, we will see how to deal with situations where we have more than one independent variable (more features) available to us.