Linear Regression: Detailed Explanations

Linear regression is an elementary but important topic in machine learning. Many real life problems have been efficiently modeled using linear regression. Before knowing about the details of linear regression, we need to understand regression first.

By regression, we mean to estimate the value of a dependent variable (output) from one or more independent variables (input) in such a way that the error between the estimated value and the actual value is minimized over all the data points where the variables are numeric in nature.

If the relationship between input and output variables is linear in nature, it is called linear regression. By linear, we mean that the set (input, output) of these data points can be fitted or plotted into a straight line equation. For example, let us assume that the salary of an employee is linearly related to the number of hours worked. Here, the number of hours is independent variable and the salary is dependent variable. We wish to estimate salary for a given number of hours worked. So, the straight line equation can be written as,

salary  =  c0 + c1* hours

Here c1 is the multiplication coefficient also known as the slope of the straight line. The higher the slope the higher the salary and c0 is the bias or intercept coefficient which can be considered as base salary when the number of hours worked is zero. This is however not a problem that machine learning deals with because it can be calculated simply by using basic algebra.

Now let’s take another example. We want to predict students’ scores in an exam based on the number of hours studied. This type of problem cannot be solved using a mathematical equation. Because for the same number of hours studied one student may get a higher score while another student may get a lower score. So, plotting the score will look like a scatter plot as shown in the figure below. But from the plot, it is evident that the trend of the scores is linearly dependent on the number of hours studied. So, if we can draw a straight line equation that satisfies this trend, we can predict the score for a given number of hours studied which was not given in the dataset. Let, the straight line equation which describes the students’ scores be represented as

score = c0 + c1* hours

In regression problem, we want to estimate the value of the coefficients [c0, c1]. Since the relationship is linear, it is called linear regression problem. The value we get from the above equation is our predicted score; actual score may be higher or lower than the predicted score. So, it is convenient to rewrite the above equation as

ypred = c0 + c1*hrs

Another way,  ypred = ytrue + Ɛ 

Where Ɛ is the error term which could be positive or negative. So, our objective is to minimize the error term. This can be expressed as

\( J = \frac{1}{n} \sum_{n=1}^{n}(ypred – ytrue)^2 \) \( \\ \)

\( or, J = \frac{1}{n} \sum_{n=1}^{n} (c0 + c1\times hrs – ytrue)^2 \) \( \\ \)

The above equation is known as the cost function. Here we take the difference between the predicted and actual scores for all n samples of data points, squared it to make it positive and averaged over all n samples. In linear regression problem cost function is of convex type. So it is a convex optimization problem which can be solved by using Gradient Descent method.

Gradient descent is an optimization technique. It uses iterative approach to update the value of an unknown parameter in the direction of maxima or minima. From calculus, we know that at the maxima or minima of a function,  the value of the first derivative is zero.  The value of c0 and c1 can be determined using gradient descent as follows:

\( c0 (updated) = c0 (previous) – \alpha \frac{\partial J}{\partial c0} \) \( \\ \)

\( c1 (updated) = c1 (previous) – \alpha \frac{\partial J}{\partial c1} \) \( \\ \)

Where [latex] \alpha [/latex] is known as the learning rate. A smaller learning rate will cause to move with smaller steps and thus takes longer time to reach the minima. On the other hand, a larger learning rate will cause to move with longer steps but it may overshoot or miss the minima which is shown in the figure below.

The above equations can be written as:

\( c0 (updated) = c0 (previous) – \alpha \frac{2}{n} \sum_{n=1}^{n}(c0 + c1\times hrs -ytrue) \) \( \\ \)

\( or, c0 (updated) = c0 (previous) – \alpha \frac{2}{n} \sum_{n=1}^{n} (ypred – ytrue) \) \( \\ \)

\( c1 (updated) = c1 (previous) – \alpha \frac{2}{n} \sum_{n=1}^{n}(c0 + c1\times hrs -ytrue)\times hrs \) \( \\ \)

\( or, c1 (updated) = c1 (previous) – \alpha \frac{2}{n} \sum_{n=1}^{n} (ypred – ytrue)\times hrs \) \( \\ \)

In the above examples, we have considered one independent variable, i.e. hours worked or hours studied. But there are situations where the outcome may be dependent on more than one variables. For example house price is dependent on square footage, number of bedrooms, number of stories and so on. This type of regression is known as multivariate linear regression. The general form of linear regression thus can be written as:

\( y = c0 + c_1 x_1 + c_2 x_2 + c_3 x_3 + \cdots + c_n x_n \)

The coefficients of the above equation can be calculated similarly using gradient descent method.

Implementation:

In this example, we will use sklearn package for linear regression. we will generate synthetic dataset using the linear equation \( y = 8x + 7 + Ɛ \)

Here Ɛ is error term which is generated from Gaussian distribution

# Import necessary library function

import random
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Generate Dataset

random.seed(100)
x = np.round(np.random.rand(30)10, 1)

y = 8x + np.array([7]30) + np.array(5 + 3np.random.randn(30))
plt.scatter(x, y)

x = x.reshape(-1,1)
y = y.reshape(-1,1)

# Split the dataset as train and test set

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3, random_state=100)

# Normalize the dataset

scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

# Perform linear regression and predict the result on test set

reg = LinearRegression()
reg.fit(x_train, y_train)

y_pred = reg.predict(x_test)

# Evaluation metric

print(‘MAE:’, mean_absolute_error(y_test, y_pred))
print(‘MSE:’, mean_squared_error(y_test, y_pred))

Data Analytics Board © . All rights reserved.