Now that we have seen simple linear regression in action, along with its maths, it’s time to move on to a bit more advanced stuff. In a real-world scenario, in most cases, you would encounter situations where you have more than one independent variable to deal with. In a situation where we use linear regression with multiple independent variables, we call it multiple linear regression.
Let’s see how we can use Python to do multiple linear regression and learn a few new things while we are at it.
Dataset
Continuing with our house prices example, I have added a couple of new columns (new features) in there for you .. number of bedrooms and bathrooms. With these 2 new features added, our total number of features comes to 3 in total.
Our dataset is tiny, this initially helped me eye-ball the data manually and make some predictions in my head. It also helped me understand the errors and how the algorithms are working in the background.
Bedrooms | Bathrooms | Land size sqm | Price |
3 | 2 | 326 | 446000 |
4 | 2 | 346 | 451500 |
4 | 3 | 371 | 463500 |
6 | 3 | 400 | 530000 |
5 | 3 | 407 | 526000 |
3 | 2 | 448 | 496000 |
4 | 2 | 452 | 560000 |
6 | 3 | 462 | 586000 |
2 | 1 | 466 | 350000 |
4 | 2 | 498 | 505000 |
5 | 3 | 500 | 510000 |
Some new things
In our previous examples, I used all of the data to train the model and then 3 separate area values to do the predictions. Moving forward, we are going to stop being naughty and properly split our data into training and test sets as discussed a while ago. In this case, we are going to reserve 20% of the data for the test set, that is 3 rows out of the 11.
We are going to discover some fancy Python functions as well that help us understand the data a bit better. Also, unlike before, we are going to load the data from a CSV file and use pandas for it.
Now that we have more than one independent variable, this means that some of them will have more sway on determining the price than others. This brings to the idea of weights, also known as coefficients.
Lastly, we will get Python to spit out different types of errors in the actual prices and the predictions. Now don’t start worrying about them yet too much, we will talk about them later.
You can copy the data and the code manually into Spyder or just head to the Github repo to download the CSV and the Python code if you please.
# load up the libraries we will need import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn import metrics # load the dataset from the csv file dataset = pd.read_csv('house-prices.csv') # shows you the dimensionality of the data (rows, columns) print(dataset.shape) # spits out some intersting details about the data print(dataset.describe()) # we always call the independent variables as X # these are our features X = dataset[['Bedrooms', 'Bathrooms', 'Land size sqm']] # the dependent variable is called y y = dataset['Price'] # lets split our data into training and test sets # the size is mentioned in a number between 0 to 1 # here we have reserved 0.2 (20%) for our test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # use regression on our training data regressor = LinearRegression() regressor.fit(X_train, y_train) # get and show us the coefficients/weights coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient']) print(coeff_df) # do some predictions on the test dataset y_pred = regressor.predict(X_test) # show us the differences in actual and predicted prices df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}) df1 = df.head(25) # plot them in a bar chart for us please df1.plot(kind='bar',figsize=(8,4)) plt.grid(which='major', linestyle='-', linewidth='0.5', color='green') plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black') plt.show() # show some fancy metrics # Mean absolute error or MAE print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred)) # Mean squared error or MSE print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred)) # Root mean squared error or RMSE print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Outputs
Firstly, you will see (11, 4), which is the output from print(dataset.shape)
. This function helps us see how many rows (11) and columns (4) are there in our dataset.
Next, you will see a table like below as a result of print(dataset.describe())
. As you can see, this table gives us some nice stats about our data.
Bedrooms Bathrooms Land size sqm Price
count 11.000000 11.000000 11.000000 11.000000
mean 4.181818 2.363636 425.090909 493090.909091
std 1.250454 0.674200 59.303380 64256.057373
min 2.000000 1.000000 326.000000 350000.000000
25% 3.500000 2.000000 385.500000 457500.000000
50% 4.000000 2.000000 448.000000 505000.000000
75% 5.000000 3.000000 464.000000 528000.000000
max 6.000000 3.000000 500.000000 586000.000000
After the stats, you will see a small table with coefficients, this is the result of print(coeff_df)
. These are pretty cool, the algorithm has figured out the weight of each feature. Simplistically, what it tells us is .. if I increase the value of this feature by one, this is how much it would add to the price.
Coefficient
Bedrooms 22568.132571
Bathrooms 38456.632683
Land size sqm 145.475886
Then you come to a pretty chart like below, showing you the difference between actual house prices and the predictions for the 3 houses in the test set. This is Pyplot at work.

And finally, you will see the errors I was talking about earlier. You can see the lines responsible for these at the end of the code. Here, the simplest one is the Mean Absolute Error (MSE), that tells us that on average that’s how off our predictions are from actuals. Different error measures are used in different scenarios to measure accuracy but we will explore them more later.
Mean Absolute Error: 24213.23990858841 Mean Squared Error: 823755495.4591683 Root Mean Squared Error: 28701.141013192635
Not too bad, was it? Hope you feel good about being able to do this as much as I do. Try it out, use lots of data once you get the hang of it. Conveniently, there are many sites that provide datasets, for instance, checkout Kaggle.