Multiple linear regression

Now that we have seen simple linear regression in action, along with its maths, it’s time to move on to a bit more advanced stuff. In a real-world scenario, in most cases, you would encounter situations where you have more than one independent variable to deal with. In a situation where we use linear regression with multiple independent variables, we call it multiple linear regression.

Let’s see how we can use Python to do multiple linear regression and learn a few new things while we are at it.

Dataset

Continuing with our house prices example, I have added a couple of new columns (new features) in there for you .. number of bedrooms and bathrooms. With these 2 new features added, our total number of features comes to 3 in total.

Our dataset is tiny, this initially helped me eye-ball the data manually and make some predictions in my head. It also helped me understand the errors and how the algorithms are working in the background.

BedroomsBathroomsLand size sqmPrice
32326446000
42346451500
43371463500
63400530000
53407526000
32448496000
42452560000
63462586000
21466350000
42498505000
53500510000

Some new things

In our previous examples, I used all of the data to train the model and then 3 separate area values to do the predictions. Moving forward, we are going to stop being naughty and properly split our data into training and test sets as discussed a while ago. In this case, we are going to reserve 20% of the data for the test set, that is 3 rows out of the 11.

We are going to discover some fancy Python functions as well that help us understand the data a bit better. Also, unlike before, we are going to load the data from a CSV file and use pandas for it.

Now that we have more than one independent variable, this means that some of them will have more sway on determining the price than others. This brings to the idea of weights, also known as coefficients.

Lastly, we will get Python to spit out different types of errors in the actual prices and the predictions. Now don’t start worrying about them yet too much, we will talk about them later.

You can copy the data and the code manually into Spyder or just head to the Github repo to download the CSV and the Python code if you please.

# load up the libraries we will need
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# load the dataset from the csv file
dataset = pd.read_csv('house-prices.csv')

# shows you the dimensionality of the data (rows, columns)
print(dataset.shape)

# spits out some intersting details about the data
print(dataset.describe())

# we always call the independent variables as X
# these are our features
X = dataset[['Bedrooms', 'Bathrooms', 'Land size sqm']]

# the dependent variable is called y
y = dataset['Price']

# lets split our data into training and test sets
# the size is mentioned in a number between 0 to 1
# here we have reserved 0.2 (20%) for our test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# use regression on our training data
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

# get and show us the coefficients/weights
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  
print(coeff_df)

# do some predictions on the test dataset
y_pred = regressor.predict(X_test)

# show us the differences in actual and predicted prices
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(25)

# plot them in a bar chart for us please
df1.plot(kind='bar',figsize=(8,4))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

# show some fancy metrics
# Mean absolute error or MAE
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  

# Mean squared error or MSE
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  

# Root mean squared error or RMSE
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Outputs

Firstly, you will see (11, 4), which is the output from print(dataset.shape). This function helps us see how many rows (11) and columns (4) are there in our dataset.

Next, you will see a table like below as a result of print(dataset.describe()). As you can see, this table gives us some nice stats about our data.

    Bedrooms  Bathrooms  Land size sqm          Price
 count  11.000000  11.000000      11.000000      11.000000
 mean    4.181818   2.363636     425.090909  493090.909091
 std     1.250454   0.674200      59.303380   64256.057373
 min     2.000000   1.000000     326.000000  350000.000000
 25%     3.500000   2.000000     385.500000  457500.000000
 50%     4.000000   2.000000     448.000000  505000.000000
 75%     5.000000   3.000000     464.000000  528000.000000
 max     6.000000   3.000000     500.000000  586000.000000

After the stats, you will see a small table with coefficients, this is the result of print(coeff_df). These are pretty cool, the algorithm has figured out the weight of each feature. Simplistically, what it tells us is .. if I increase the value of this feature by one, this is how much it would add to the price.

            Coefficient
 Bedrooms       22568.132571
 Bathrooms      38456.632683
 Land size sqm    145.475886

Then you come to a pretty chart like below, showing you the difference between actual house prices and the predictions for the 3 houses in the test set. This is Pyplot at work.

And finally, you will see the errors I was talking about earlier. You can see the lines responsible for these at the end of the code. Here, the simplest one is the Mean Absolute Error (MSE), that tells us that on average that’s how off our predictions are from actuals. Different error measures are used in different scenarios to measure accuracy but we will explore them more later.

Mean Absolute Error: 24213.23990858841
Mean Squared Error: 823755495.4591683
Root Mean Squared Error: 28701.141013192635

Not too bad, was it? Hope you feel good about being able to do this as much as I do. Try it out, use lots of data once you get the hang of it. Conveniently, there are many sites that provide datasets, for instance, checkout Kaggle.

Leave a comment