Linear Regression from scratch in Python

9 min readDec 10, 2021

It's so easy to call the Scikit-Learn library or any other machine learning library and use the linear regression model out of the box. But what if we want to understand what is going for the optimization of our model?

What is Linear Regression?

Linear regression is a machine learning algorithm dealing with continuous data and is considered a supervised machine learning algorithm. Linear regression is a useful tool for predicting a quantitative response.

In linear regression, we are interested if we can model the relationship between two variables. The linear regression algorithm will try to model the relationship between these two variables as a straight line equation.

Equation of a straight line

In this article, I will be going through how we can build everything from scratch (i.e no Scikit-Learn) with examples:

Data Preprocessing
Cost Function
Gradient Descent
Predictor Function
Evaluation Metrics
Polynomial Transformation and Regression

Overall Pipeline that we will be going through

Set-Up

In order to follow through with the tutorial, we have to first download the following.

Python v3.8
Pandas library
Numpy library
Jupyter Notebook (or you could use google collab instead)

You can install by the following:

pip3 install virtualenv
# Create a virtual environmentvirtualenv env# Activating Virtual Environment# On Mac or linux
source env/bin/activate# On Windows
env\Scripts\activatepip install jupyter pandas numpy# Open jupyter notebookjupyter notebook

The jupyter notebook should open up with the link. Or you can try http://localhost:8888/tree.

First, we are going to import our libraries and our dataset

# importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np# Downloading Housing Dataset
file_url = 'https://www.dropbox.com/s/jz8ck0obu9u1rng/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv?raw=1'
df = pd.read_csv(file_url)

1. Data Preprocessing

For data preprocessing we would have to

Check if there is a linear correlation with the dataset
Normalize the Dataset
Split into training and test dataset

1. Checking for linear correlation of the predictor variables and the target variables

Before we even begin using the linear regression model, we would have to see if the linear regression model is suitable. We can do this with a scatter plot.

In the codes below, we read the dataset, and choose the resale price from TAMPINES and plot the relationship between the resale price and the floor area.

df_tampines = df.loc[df['town'] == 'TAMPINES',:]
sns.scatterplot(y='resale_price', x='floor_area_sqm', data=df_tampines)

Notice that the resale price increases as the floor area increases. So we can make a hypothesis by creating a straight line equation that predicts the resale price given the floor area data. The figure below shows the plot of a straight line and the existing data together.

2. Normalize Dataset

Many times, we will need to normalize the data, both the features and the target. The reason is that each column in the dataset may have different scales. There are 2 different normalizations that we can use :

Z Normalization

This is also called standardization. In this transformation, we move the mean of the data distribution to 0 and its standard deviation to 1. The equation is given as follows.

Code :

def normalize_z(dfin):
   
    dfout = (dfin - dfin.mean(axis=0))/dfin.std(axis=0)return dfout

Min-Max Normalization

In this transformation, we scale the data in such a way that the maximum value in the distribution is 1 and its minimum value is 0. We can scale it using the equation above.

Code :

def normalize_minmax(dfin):    dfout = (dfin - dfin.min(axis=0))/(dfin.max(axis=0)-    dfin.min(axis=0))
            
    return dfout

3. Splitting Dataset into training and test

One common pre-processing operations that we normally do in machine learning is to split the data into:

training dataset
test dataset

The idea of splitting the dataset is simply because we should NOT use the same data to verify the model which we train. Let's illustrate this using our HDB resale price dataset. We have this HDB resale price dataset with 95858 entries. In our machine learning process, we would like to use this data to do the following:

Train the model using the dataset
Verify the accuracy of the model

If we only have one dataset, we cannot use the same data to verify the accuracy with the ones we use to train the model. This bias would obviously create high accuracy. The analogy is like when a teacher giving exactly the same question during the exam as the ones during the practice session.

To overcome this, we should split the data into two. One set is used to train the model while the other one is used to verify the model. Coming back to the analogy of the teacher giving the exam, if the teacher has a bank of questions, he or she should separate the questions into two. Some questions can be used for practice before the exam, while the rest can be used as exam questions.

One important note is that the split must be done randomly. This is to avoid systematic bias in the split of the dataset. For example, one may enter the data according to the flat type and so flat with smaller rooms and smaller floor area will be on the top rows and those with the larger flats will be somewhere at the end of the rows. If we do not split the data randomly, we may not have any larger flats in our training set and only use the smaller flats. Similarly it can happen with any other column such as the block or the town area.

There are times in machine learning, we need to experiment with different parameters and find the optimum parameters. In these cases, the dataset is usually split into three:

training dataset, which is used to build the model
validation dataset, which is used to evaluate the model for various parameters and to choose the optimum parameter
test dataset, which is used to evaluate the model built with the optimum parameter found previously

2. Cost Function

In order to find the values of β⁰ and β¹, we will apply an optimization algorithm that minimizes the error. The error caused by the difference between our predicted value y^ and the actual data y is captured in a cost function. Let’s find our cost function for this linear regression model.

We can get the error by taking the difference between the actual value and our hypothesis and squaring them. The square is to avoid cancellation due to positive and negative differences. This is to get our absolute errors. For one particular data point i, we can get the error square as follows.

absolute residual error

Assume we have m data points, we can then sum over all data points to get the Residual Sum Square (RSS) of the errors.

Residual Sum Square (RSS)

We can then choose the following equation as our cost function.

The division by m is to get an average over all data points. The constant 2 in the denominator makes the derivative easier to calculate.

The learning algorithm will then try to obtain the constant β⁰ and β¹ that minimizes the cost function.

Code for Cost Function

def compute_cost(X, y, beta):
    m = X.shape[0]
    y_pred = np.matmul(X,beta)
    error = y_pred - y
    
    # Matrix multiplcation does both sum and square
    J = (1/(2*m))*np.matmul(error.T,error)
    
    return J[0][0] # Extract a scalar value from the 1 x 1 matrix

3. Gradient Descent

One of the algorithms that we can use to find the constants by minimizing the cost function is called gradient descent. The algorithm starts with some initial guess of the constants and uses the gradient of the cost function to make a prediction of where to go next to reach the bottom or the minimum of the function. In this way, with some initial values of β⁰ and β¹, we can calculate its next values using the following equation.

Updating the weightage (beta) by going towards the direction of minimizing the cost function

Code:

# 1. guess beta
# 2. Decide direction 
# 3. updadte betadef gradient_descent(X, y, beta, alpha, num_iters):
    m = X.shape[0]
    J_storage = np.zeros(num_iters)
    
    for i in range(num_iters):
        yp = np.matmul(X,beta)
        error = yp - y
        beta = beta - (alpha/m) * np.matmul(X.T,error)
#         beta = beta - (alpha/m)*np.matmul(X.T,np.matmul(X,beta)-y)
        cost = compute_cost(X,y,beta)
#         print(cost)
        J_storage[i] = cost
#         print(J_storage[i])
    
    
    return beta, J_storage

4. Predictor Function

So after we have trained our Model, what do we need to do?

We would have to use our model for prediction with the test dataset

def predict(df_feature, beta):
    df_feature = normalize_z(df_feature)
    X = prepare_feature(df_feature)return predict_norm(X, beta)def predict_norm(X, beta):
    return np.matmul(X,beta)

5. Evaluation Metrics

After we build our model, we usually want to evaluate how good our model is. We use metrics to evaluate our model or hypothesis. To do this, we should split the data into two:

training data set
test data set

The training data set is used to build the model or the hypothesis. The test data set, on the other hand, is used to evaluate the model by computing some metrics.

Mean Square Error

One metric we can use here is called the mean squared error. The mean squred error is computed as follows.

MSE

where n is the number of predicted data points in the test data set, yi is the actual value in the test data set, and y^i is the predicted value obtained using the hypothesis and the independent variable xi in the test data set.

def mean_squared_error(target, pred):
    n = target.shape[0]
    error = target-pred
    mse = np.matmul(error.T,error)/n
    
    return mse

R2 Coefficient of Determination

Another metric is called the r2 coefficient or the coefficient of determination. This is computed as follows.

where

where yi is the actual target value and y^i is the predicted target value.

where

and n is the number of target values.

This coefficient gives you how close the data is to a straight line. The closer it is to a straight line, the value will be close to 1.0. This means there is a correlation between the independent variable and the dependent variable. When there is no correlation, the value will be close to 0. If it is negative, when the model selected does not follow the trend of the data, therefore leading to a worse fit than the horizontal line.

Code

def 2_score(y, ypred):    rss = np.sum((ypred - y) ** 2)
    tss = np.sum((y-y.mean()) ** 2)
    
    r2 = 1 - (rss / tss)
    
    return r2

6. Polynomial Transformation and Regression

There are time that even when there is only one feature we may want to have a hypothesis that is not a straight line. An example of would be if our model is a quadratic equation. We can use multiple linear regression to create hypothesis beyond a straight line.

Recall that in multiple linear regression, the hypothesis is writen as follows.