Linear Regression Model

Linear regression can be very useful in many business situations. The author has walked you through how to create a linear regression model.

Tarun Saini

CORE ·

Dec. 04, 20 · Opinion

Like (6)

Save

6.96K Views

Linear regression is a machine learning technique that is used to establish a relationship between a scalar response and one or more explanatory variables. The first scaler response is called a target or dependent variable while the explanatory variables are known as a response or independent variables. When more than one independent variable is used in the modeling technique we call it multiple linear regression.

Independent variables are known as explanatory variables as they can explain the factors that control the dependent variable along with the degree of the impact. This can also be calculated using ‘parameter estimates’ or ‘coefficients’.

Coefficients are tested for statistical significance with the help of confidence intervals built around them which also aids model robustness. The elasticity is based on the coefficient. They can describe the extent to which a certain factor explains the dependent. Adding to this, a negative coefficient can be interpreted to have a negative or an inverse relation with the dependent variable and a positive coefficient can be said to have a positive influence.

Application of Linear Regression

Insights on consumer behavior
Understanding business factors influencing profitability
Estimates or forecasts
marketing effectiveness, pricing, and promotions strategies
Assess and minimize risk in portfolios
Understand important factors leading to the customer default

A very broad overview of the important steps needed to build a linear regression model could be

Feasibility of application of regression technique on the data set
Preparing the data to make it ready for regression
Build the regression model and test its accuracy
Save the model for future prediction
Deploy and maintain/monitor the model

Techniques to Build a Linear Regression Model

We can build a linear regression model by using any of the below techniques

Gradient Descent
Ordinary Least Square(OLS)

The process of minimizing a function with the help of gradients of the cost function is called gradient descent(GD). The understanding of the form of the cost and its derivative is important in order to understand the path which needs to be followed. In machine learning parlance, this can be done using a very similar concept known as stochastic gradient descent(SGD). This is also responsible for minimizing the error of the model on training data.

Predictions are made by the model based on each training instance shown to the model. This results in error calculation which is propagated/updated in the next prediction which helps to reduce the error.

The ordinary least squares method is a technique for estimating unknown parameters in a linear regression model with the help least square method. It aims to minimize the sum of squares of the differences between the observed and the predicted points.

The applicability of the OLS technique is based on certain assumptions. Hence it is a good practice to check the assumptions of OLS before we apply it to build the linear regression model. The assumptions of OLS are mentioned below

Linear relationship - the relationship between the independent and dependent variables is linear
Multivariate normality - all variables to be multivariate normal
No or little multicollinearity - there is little or no multicollinearity in the data. Multicollinearity is the phenomenon when the independent variables are highly correlated with each other
No auto-correlation - There is little or no autocorrelation in the data. Autocorrelation can be experienced when the residuals are not independent of each other
Homoscedasticity - Residuals exhibit homoscedasticity. Homoscedasticity describes a situation in which the error term is the same across all values of the independent variable

Build a Linear Regression Model from Scratch

Let's build a linear regression from scratch using a publicly available data set using both OLS and SGD techniques in python.

There are a lot of ways/libraries to build a linear regression model in python, but we will mostly concentrate on scikit learn library, pandas, and NumPy mainly to develop the same.

Sharing some of the screenshots during data wrangling, model building, and model saving activities.

1. Importing the necessary Python libraries

    Python
   
          x
         
# Enabling print for all lines
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

    Python
   
xxxxxxxxxx

# Importing the necessary libraries
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline
import seaborn as sns

2. Loading the Data Set

Boston Housing Data

Description:

This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

The data was originally published by Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

Dataset Naming:

The name for this dataset is simply boston. MEDV is the variable which we are trying to predict

Miscellaneous Details:

Origin- The origin of the boston housing data is Natural
Usage- This dataset may be used for Assessment
Number of Cases- The dataset contains a total of 506 cases
Order- The order of the cases is mysterious
Variables- There are 14 attributes in each case of the dataset. They are:
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per 10,000 dollars
PTRATIO - pupil teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner occupied homes in 1000s

    Python
   
xxxxxxxxxx

# Loading the data set from Python
from sklearn.datasets import load_boston
boston = load_boston()
type(boston)

Output:

sklearn.utils.Bunch

    Python
   
xxxxxxxxxx

# Converting the inbuilt data(independen) into data frame
df = pd.DataFrame(boston.data, columns=boston.feature_names)
# Converting the target into a series
df['target'] = pd.Series(boston.target)

3. Having a look at the different variables from the data set

Input:

    Python
   
xxxxxxxxxx

# Having a look at the first few observations in the data and basic data properties
df.head()
df.shape
df.info()

Output:

Output 5

4. Insights from the data set

Input:

    Python
   
xxxxxxxxxx

# Checking the statistical properties of the variables
df.describe().T

Output:

Output 6

Input:

    Python
   
xxxxxxxxxx

# Check for column wise missing values
df.isnull().sum()

Output:

output 7

Input:

    Python
   
xxxxxxxxxx

# Building the box
rcParams['figure.figsize'] = 20,5
df.boxplot(color=dict(boxes='r', whiskers='r', medians='r', caps='r'))
# sns.set(rc={'figure.figsize':(6,6)})
# sns.set_style("whitegrid")
# sns.boxplot(data=df, orient="h", palette="Set2")

Output:

Output 8

Input:

    Python
   
xxxxxxxxxx

# Finding and plotting the correlation for the independent variables
sns.set(rc={'figure.figsize':(14,5)})
ind_var = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
# df[ind_var].corr()
sns.heatmap(df[ind_var].corr(), cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15));

Output:

Output 9

Input:

    Python
   
xxxxxxxxxx

# Building a pair plot to understand the relationship between independent and dependent variables
sns.pairplot(df, x_vars = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM'], y_vars = ['target']);
sns.pairplot(df, x_vars = ['AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], y_vars = ['target']);

Output:

Output 10

From the above pair plot, we can conclude

RM and MEDV have a shape like that in a normally distributed graph
AGE is skewed to the left and LSTAT is skewed to the right
TAX has a large amount of distribution around the point 700

5. Building the model using scikit learn

Input:

    Python
   
xxxxxxxxxx

# Separating out the independent and target data
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

Input:

    Python
   
xxxxxxxxxx

# Spitting the data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=113)
# Checking the shape of the four parts of the data after split
X_train.shape, X_test.shape
y_train.shape, y_test.shape

Output:

((404, 13), (102, 13))

((404,), (102,))

Input:

    Python
   
xxxxxxxxxx

# Building the linear regression model using sklearn
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
# Fitting the train data to the model
model = regressor.fit(X_train, y_train)

6. Prediction using the model and scoring the model

Input:

    Python
   
xxxxxxxxxx

# Making predictions on test data using the above model
y_pred = model.predict(X_test)

    Python
   
xxxxxxxxxx

# Lets try to see the model accuracy/score on the training data
model.score(X_train, y_train)
# Lets try to see the model accuracy/score on the test data
model.score(X_test, y_test)

Output:

0.7345219949558541

0.7505954479592696

Input:

    Python
   
xxxxxxxxxx

from sklearn.metrics import mean_absolute_error, mean_squared_error, explained_variance_score
import math
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
mv = explained_variance_score(y_test, y_pred)
print("Mean Absolute Error is", round(mae,1))
print("Root Mean Squared Error is", round(rmse,1))
print("Variance explained by model", round(mv*100,1), "%")

Output:

Mean Absolute Error is 3.4 
Root Mean Squared Error is 5.0 
Variance explained by model 75.1 %

Input:

    Python
   
xxxxxxxxxx

# Linear regression coefficient and intercept. We can build the regression equation using the below parameter
model.coef_
model.intercept_

Output:

array([-1.14331280e-01,  3.30399664e-02,  2.19911151e-02,  1.93047806e+00,       -1.53459876e+01,  4.11678898e+00, -5.20475977e-03, -1.26111638e+00,        3.52665352e-01, -1.37375084e-02, -1.01521476e+00,  9.98692962e-03,       -4.90950094e-01])

33.54862614612348

7. Saving the model locally

Input:

    Python
   
xxxxxxxxxx

# Saving the model
import pickle
filename = 'boston_regression_model.sav'
pickle.dump(model, open(filename, 'wb'))
# Loading the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = round(loaded_model.score(X_test, y_test)*100,2)
print("Moel accuracy is", result, "%")

Output:

Model accuracy is 75.06 %

Conclusion:

To summarize, linear regression can be very useful in many business situations however it can also have limited applicability in certain scenarios as it can work only when the dependent variable is continuous in nature. To learn more about the best machine learning course, you can click on the link.

Linear regression Python (language) Machine learning Data set Data (computing) Build (game engine)

Opinions expressed by DZone contributors are their own.

Comments

Trending

Trending

Linear Regression Model

Linear regression can be very useful in many business situations. The author has walked you through how to create a linear regression model.

Application of Linear Regression

Techniques to Build a Linear Regression Model

Build a Linear Regression Model from Scratch

1. Importing the necessary Python libraries

2. Loading the Data Set

3. Having a look at the different variables from the data set

4. Insights from the data set

5. Building the model using scikit learn

6. Prediction using the model and scoring the model

7. Saving the model locally

Conclusion:

Trending

Partner Resources