Linear regression is great for finding continuous dependent variables but what if we wanted to classify the data in categories or classes instead? When we come across this type of situation, we can look at logistic regression. which limits the output between 0 and 1, helping us assign them into specific classes.
Let’s continue to work with our house prices example. I kept thinking about how to utilise logistic regression with the house prices data so I came up with a rather simple example. All I want to determine is if the house will sell above $1.5m or not. So just 2 classes, 0 and 1 (this is called binary logistic regression).
For this example, I am using a bit more data than before, here is a small chunk.
|Bedrooms||Bathrooms||Land size sqm||Price||Above1.5m|
You might be thinking, what’s hard about that, the price column is just there! Well, I decided to keep that column in there for pure convenience, we won’t, however, use it for our modelling.
Some new things
We are using
LogisticRegression(solver='lbfgs'). Don’t worry, a “solver” is just the algorithm that you want to use to solve the problem, there are a few options, checkout scikit-learn documentation for the other options. Looking at the documentation can be overwhelming at times, however, it’s the best source of truth for the library in question.
Let’s briefly talk about the activation function. An activation function is basically like a switch, above a certain value, it will trigger and below it stays off. In our case, logistic regression will use Sigmoid, which is an S-shaped curve with values ranging between 0 and 1. In simple terms, it triggers above 0.5 (50% probability) and assigns the input to the ‘1’ category, otherwise pushes it into the ‘0’ category.
I have also introduced the confusion matrix and some new metrics in there. We will discuss them in a bit of detail later. For
# load up the libraries we will need import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn import metrics # load the dataset from the csv file dataset = pd.read_csv('house-prices.csv') # print the first few lines of the data print(dataset.head()) # shows you the dimensionality of the data (rows, columns) print(dataset.shape) # spits out some interesting details about the data print(dataset.describe()) # we always call the independent variables as X # these are our features X = dataset[['Bedrooms', 'Bathrooms', 'Land size sqm']] # the dependent variable is called y y = dataset['Above1.5m'] # lets split our data into training and test sets # the size is mentioned in a number between 0 to 1 # here we have reserved 0.2 (20%) for our test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # instantiate the model (using the default parameters) regressor = LogisticRegression(solver='lbfgs') # fit the model with data regressor.fit(X_train, y_train) # lets make some predictions y_pred = regressor.predict(X_test) # show me the confusion matrix, takes the format # (TrueNegative, FalsePositive, FalseNegative, TruePositive) confusion_matrix = metrics.confusion_matrix(y_test, y_pred).ravel() print(confusion_matrix) # print out some metrics # how often is the prediction correct? print('Accuracy [ (TP + TN)/Total ] :', metrics.accuracy_score(y_test, y_pred)) # whats the ratio of true positives to the all the positives returned print('Precision [ TP/(TP + FP) ] :', metrics.precision_score(y_test, y_pred)) # true positives ratio to how many should have been true postives print('Recall [ TP/(TP + FN) ]:', metrics.recall_score(y_test, y_pred)) # find and show me area under the curve - AUC - ROC y_pred_proba = regressor.predict_proba(X_test)[::,1] false_positive_rate, true_positive_rate, _ = metrics.roc_curve(y_test, y_pred_proba) auc = metrics.roc_auc_score(y_test, y_pred_proba) plt.plot(false_positive_rate, true_positive_rate, label='auc=' + str(auc)) plt.legend(loc=4) plt.show()
I have used
print(dataset.head()) in there, which shows us the top bit of our data file, useful for quickly showing what’s in the file.
Next comes the dimensionality matrix in there, this time it’s (215, 5), meaning it has 215 rows and 5 columns.
After the dimensionality matrix, you will see some familiar stats as before about the data in the CSV file.
Here the script also prints out the confusion matrix, that tells us the number of True Positives, True Negatives, False Positives and False Negatives. We will talk about it more later.
Next, some new interesting metrics with values ranging between 0 and 1.
Accuracy [ (TP + TN)/Total ] : 0.8372093023255814 Precision [ TP/(TP + FP) ] : 0.8857142857142857 Recall [ TP/(TP + FN) ]: 0.9117647058823529
For our case, this means ~84% of the time, the predictions are correct (accuracy). Precision tells us that out of the houses marked for sale above $1.5m, ~89% were correct (so ~11% of them were not >$1.5m houses). Recall on the other hand tells us that out of all the >$1.5m houses, the model correctly categorised ~91% of them and failed to find ~9% of those houses.
In the end, we see the ROC (Receiver Operating Characteristic) area under the curve (AUC) chart.
Generally speaking, the more area below the blue line the better, in our case, it’s ~82% as shown by the AUC figure in the chart, which is pretty good. For now, that’s probably enough, we will talk more about this soon.