A simple example

Let’s continue with our example of house prices and see if we can clearly see the case for machine learning. As an example, we will try to predict the price of a 475 sqm house.

Here is my sample data:

Land size sqmPrice
326$446,000
346$451,500
371$463,500
400$530,000
407$526,000
448$496,000
452$560,000
462$586,000
466$350,000
498$505,000
500$510,000

If we look at the data above, we can clearly see that there is something weird going on between the 460 and the 500 sqm range so it’s hard to ballpark this. A more involved but still simple approach would be to calculate the square meter cost of each house and then use the median to figure out the price. Let’s see what that looks like.

Land size sqmPricePrice per sqmMedian per sqm * sqm
326$446,000$1,368$407,280
346$451,500$1,305$432,267
371$463,500$1,249$463,500
400$530,000$1,325$499,730
407$526,000$1,292$508,476
448$496,000$1,107$559,698
452$560,000$1,239$564,695
462$586,000$1,268$577,189
466$350,000$751$582,186
498$505,000$1,014$622,164
500$510,000$1,020$624,663

I have added a couple of new columns in there, here is what I have done:

  • 3rd column shows the price per sqm
  • I took the median of the 3rd column, which is $1,249
  • Just for fun, I multiplied the median value with the 1st column to get a new price for the house

As you can clearly see with the 4th column above, the newly calculated prices are significantly off from the original sale price in column 2. Clearly, our median approach is not a good one.

Enter Linear Regression

I don’t really like Excel but it is plenty useful at times, I visualised by data in a chart quickly. Here it is below.

The orange line is the original price and the blue line is the newly calculated price with our median per sqm price. Excel convinces me on its usefulness when it shows me the option to add a “Trendline” in there, it’s the green line shown above. This trendline by default is drawn using linear regression, which can be changed of course.

Even just looking at the chart, I can see that the green line seems more realistic as compared to the blue line. As per the blue line (our median price per sqm method), the predicted price for our 475 sqm house is around $600k ($1,249 x 475 = $593,275 to be precise), the green line, however, tells us a different story of $500k.

It might be hard to believe but by using a trendline, we have just made use of machine learning for the first time! So pat yourself on the back, as did I!

Great but how does linear regression work?

In simple terms, the linear regression algorithm is trying to draw a line that minimises the distance from all of the real price points on the chart. This can be clearly seen visually by looking at the green and orange line in the above chart. The distance between the points on the orange and green lines is called the error. We will look into this in detail next.

Leave a comment