Linear Regression Model
Linear regression can be very useful in many business situations. The author has walked you through how to create a linear regression model.
Join the DZone community and get the full member experience.
Join For FreeLinear regression is a machine learning technique that is used to establish a relationship between a scalar response and one or more explanatory variables. The first scaler response is called a target or dependent variable while the explanatory variables are known as a response or independent variables. When more than one independent variable is used in the modeling technique we call it multiple linear regression.
Independent variables are known as explanatory variables as they can explain the factors that control the dependent variable along with the degree of the impact. This can also be calculated using ‘parameter estimates’ or ‘coefficients’.
Coefficients are tested for statistical significance with the help of confidence intervals built around them which also aids model robustness. The elasticity is based on the coefficient. They can describe the extent to which a certain factor explains the dependent. Adding to this, a negative coefficient can be interpreted to have a negative or an inverse relation with the dependent variable and a positive coefficient can be said to have a positive influence.
Application of Linear Regression
- Insights on consumer behavior
- Understanding business factors influencing profitability
- Estimates or forecasts
- marketing effectiveness, pricing, and promotions strategies
- Assess and minimize risk in portfolios
- Understand important factors leading to the customer default
A very broad overview of the important steps needed to build a linear regression model could be
- Feasibility of application of regression technique on the data set
- Preparing the data to make it ready for regression
- Build the regression model and test its accuracy
- Save the model for future prediction
- Deploy and maintain/monitor the model
Techniques to Build a Linear Regression Model
We can build a linear regression model by using any of the below techniques
- Gradient Descent
- Ordinary Least Square(OLS)
The process of minimizing a function with the help of gradients of the cost function is called gradient descent(GD). The understanding of the form of the cost and its derivative is important in order to understand the path which needs to be followed. In machine learning parlance, this can be done using a very similar concept known as stochastic gradient descent(SGD). This is also responsible for minimizing the error of the model on training data.
Predictions are made by the model based on each training instance shown to the model. This results in error calculation which is propagated/updated in the next prediction which helps to reduce the error.
The ordinary least squares method is a technique for estimating unknown parameters in a linear regression model with the help least square method. It aims to minimize the sum of squares of the differences between the observed and the predicted points.
The applicability of the OLS technique is based on certain assumptions. Hence it is a good practice to check the assumptions of OLS before we apply it to build the linear regression model. The assumptions of OLS are mentioned below
- Linear relationship - the relationship between the independent and dependent variables is linear
- Multivariate normality - all variables to be multivariate normal
- No or little multicollinearity - there is little or no multicollinearity in the data. Multicollinearity is the phenomenon when the independent variables are highly correlated with each other
- No auto-correlation - There is little or no autocorrelation in the data. Autocorrelation can be experienced when the residuals are not independent of each other
- Homoscedasticity - Residuals exhibit homoscedasticity. Homoscedasticity describes a situation in which the error term is the same across all values of the independent variable
Build a Linear Regression Model from Scratch
Let's build a linear regression from scratch using a publicly available data set using both OLS and SGD techniques in python.
There are a lot of ways/libraries to build a linear regression model in python, but we will mostly concentrate on scikit learn library, pandas, and NumPy mainly to develop the same.
Sharing some of the screenshots during data wrangling, model building, and model saving activities.
1. Importing the necessary Python libraries
# Enabling print for all lines
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
xxxxxxxxxx
# Importing the necessary libraries
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline
import seaborn as sns
2. Loading the Data Set
Boston Housing Data
Description:
This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.
The data was originally published by Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
Dataset Naming:
The name for this dataset is simply boston. MEDV is the variable which we are trying to predict
Miscellaneous Details:
- Origin- The origin of the boston housing data is Natural
- Usage- This dataset may be used for Assessment
- Number of Cases- The dataset contains a total of 506 cases
- Order- The order of the cases is mysterious
- Variables- There are 14 attributes in each case of the dataset. They are:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per 10,000 dollars
- PTRATIO - pupil teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner occupied homes in 1000s
xxxxxxxxxx
# Loading the data set from Python
from sklearn.datasets import load_boston
boston = load_boston()
type(boston)
Output:
sklearn.utils.Bunch
xxxxxxxxxx
# Converting the inbuilt data(independen) into data frame
df = pd.DataFrame(boston.data, columns=boston.feature_names)
# Converting the target into a series
df['target'] = pd.Series(boston.target)
3. Having a look at the different variables from the data set
Input:
xxxxxxxxxx
# Having a look at the first few observations in the data and basic data properties
df.head()
df.shape
df.info()
Output:
4. Insights from the data set
Input:
xxxxxxxxxx
# Checking the statistical properties of the variables
df.describe().T
Output:
Input:
xxxxxxxxxx
# Check for column wise missing values
df.isnull().sum()
Output:
Input:
xxxxxxxxxx
# Building the box
rcParams['figure.figsize'] = 20,5
df.boxplot(color=dict(boxes='r', whiskers='r', medians='r', caps='r'))
# sns.set(rc={'figure.figsize':(6,6)})
# sns.set_style("whitegrid")
# sns.boxplot(data=df, orient="h", palette="Set2")
Output:
Input:
xxxxxxxxxx
# Finding and plotting the correlation for the independent variables
sns.set(rc={'figure.figsize':(14,5)})
ind_var = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
# df[ind_var].corr()
sns.heatmap(df[ind_var].corr(), cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15));
Output:
Input:
xxxxxxxxxx
# Building a pair plot to understand the relationship between independent and dependent variables
sns.pairplot(df, x_vars = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM'], y_vars = ['target']);
sns.pairplot(df, x_vars = ['AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], y_vars = ['target']);
Output:
From the above pair plot, we can conclude
- RM and MEDV have a shape like that in a normally distributed graph
- AGE is skewed to the left and LSTAT is skewed to the right
- TAX has a large amount of distribution around the point 700
5. Building the model using scikit learn
Input:
xxxxxxxxxx
# Separating out the independent and target data
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
Input:
xxxxxxxxxx
# Spitting the data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=113)
# Checking the shape of the four parts of the data after split
X_train.shape, X_test.shape
y_train.shape, y_test.shape
Output:
((404, 13), (102, 13))
((404,), (102,))
Input:
xxxxxxxxxx
# Building the linear regression model using sklearn
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
# Fitting the train data to the model
model = regressor.fit(X_train, y_train)
6. Prediction using the model and scoring the model
Input:
xxxxxxxxxx
# Making predictions on test data using the above model
y_pred = model.predict(X_test)
xxxxxxxxxx
# Lets try to see the model accuracy/score on the training data
model.score(X_train, y_train)
# Lets try to see the model accuracy/score on the test data
model.score(X_test, y_test)
Output:
0.7345219949558541
0.7505954479592696
Input:
xxxxxxxxxx
from sklearn.metrics import mean_absolute_error, mean_squared_error, explained_variance_score
import math
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
mv = explained_variance_score(y_test, y_pred)
print("Mean Absolute Error is", round(mae,1))
print("Root Mean Squared Error is", round(rmse,1))
print("Variance explained by model", round(mv*100,1), "%")
Output:
Mean Absolute Error is 3.4 Root Mean Squared Error is 5.0 Variance explained by model 75.1 %
Input:
xxxxxxxxxx
# Linear regression coefficient and intercept. We can build the regression equation using the below parameter
model.coef_
model.intercept_
Output:
array([-1.14331280e-01, 3.30399664e-02, 2.19911151e-02, 1.93047806e+00, -1.53459876e+01, 4.11678898e+00, -5.20475977e-03, -1.26111638e+00, 3.52665352e-01, -1.37375084e-02, -1.01521476e+00, 9.98692962e-03, -4.90950094e-01])
33.54862614612348
7. Saving the model locally
Input:
xxxxxxxxxx
# Saving the model
import pickle
filename = 'boston_regression_model.sav'
pickle.dump(model, open(filename, 'wb'))
# Loading the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = round(loaded_model.score(X_test, y_test)*100,2)
print("Moel accuracy is", result, "%")
Output:
Model accuracy is 75.06 %
Conclusion:
To summarize, linear regression can be very useful in many business situations however it can also have limited applicability in certain scenarios as it can work only when the dependent variable is continuous in nature. To learn more about the best machine learning course, you can click on the link.
Opinions expressed by DZone contributors are their own.
Trending
-
5 Steps to Prioritize Tasks Better as a Software Developer
-
Software Shops and Name Recognition
-
Competing Consumers With Spring Boot and Hazelcast
-
How To Optimize the Performance and Security of Your Website Using Modern Tools and Techniques
Comments