Linear regression is the most widely used supervised learning algorithm. It is used for predicting the value of a variable based on input data. For example, forecasting sales in upcoming months based on marketing expenditure OR predicting the stock price range in the upcoming year.
In other words, Linear regression is a type of regression analysis used for predicting the unknown value of the dependent variable based upon the known value of the independent variable. Here both the variables are in a linear relationship.
Linear regression establishes the relationship between two variables by fitting a linear equation to observed data. For example, If we increase the marketing budget it will eventually increase product sales. This is a kind of positive relationship between the dependent and independent variables.
Before getting into more details of linear regression let us first understand what is regression analysis?
What is the regression analysis?
If you have variables that are correlated then you can establish an equation between them. This equation can predict one of the variable’s value if the other variable’s values are known.
It is a statistical process of establishing a relationship between a dependent variable (unknown) and independent variables (known).
- Dependent Variable (x): It is the variable whose value we are finding or trying to predict. Since its value is dependent upon other variables so it is called the dependent variable. For example, The stock price is a dependent variable as its future value which depends upon many other factors or variables.
- Independent Variable (y): It is a variable that does not depend on any other variable but it can affect the value of the dependent variable. The stock price varies based upon the time. So the time here becomes an independent variable.
- Regression curve:
- It is obtained after plotting all the data points (dependent and independent variables values) on graph paper and drawing a curve that best fits all the data points.
- This is the curve that best describes how dependent variable value changes with the change in the independent variable value. Giving us an idea of the trend in the given data.
- The regression curve can be a curve or a straight line depending upon the type of regression analysis. For linear regression, it is a straight line. Whereas for logistic regression, it is a sigmoid curve.
Product sales basically depend on the marketing of the product. Therefor variable “Product Sales” becomes the dependent variable and since “Marketing Cost” does not depend on any other factor it becomes an independent variable.
Classification of regression analysis:
Regression analysis can be classified based on different factors like the number of independent variables, Shape of the regression line, Type of dependent variable, etc.
1. Based on Number of Independent Variables:
In all the examples we have seen until now, we we have considered only one independent variable.
However, consider the case when the dependent variable value depends on more than one independent variable. Such as product sales depend upon marketing as well as R&D cost.
- Simple Linear regression: When a dependent variable value is calculated based on only one independent variable then it is known as simple linear regression. e.g Calculating crop yield based on only total rainfall data.
- Multiple linear regression: When a dependent variable value is calculated based on more than two independent variables then it is known as multiple linear regression. e.g. Calculating crop yield based upon total rainfall as well as temperature data.
2. Based on Shape of the Regression line:
When we plot all the data points (Values of the dependent and independent variable) on a graph and plot the regression curve. In some regression techniques, the shape of the regression curve is a straight line whereas for other techniques it comes out to be a curve. So based upon the shape of the regression curve we can classify the regression analysis into two types.
- Linear: In this regression model values of a dependent and independent variable varies linearly. If we try to plot the graph of dependent and independent variables it is mostly a straight line and not the curve. In the case of linear regression, it is a straight line whereas.
- Non-Linear: When there is a non-linear relationship between the dependent and independent variable then the regression line comes out to be a curve and not a straight line. For example, In case of Logistic regression problem regression curve is a sigmoid curve.
3. Based on data type of Dependent Variable:
In regression analysis, the dependent variable’s data type can be a continuous quantity (Any possible integer number) or a categorical value (Binary value i.e between 0 to 1). So based on the data type of dependent variable regression analysis can be classified into two types;
- Continuous Value: Here dependent variable value is a continuous quantity. It can be any possible integer number. For example, The stock price is a continuous quantity since it can have any possible value. Linear regression algorithms are used to solve these problems.
- Categorical Value: Here dependent variable value is categorical or binary in nature i.e. Yes/No, True/False, 0/1, etc. The logistic algorithm is used for solving classification problems. Such as predicting whether the incoming email is spam or not OR based on weather conditions will rain today or not, etc.
Simple Linear regression:
It is a type of regression analysis wherein the values of a dependent and independent variable vary linearly that is why this model is called a linear regression model. For example, the price of the house increases with an increase in its area or size.
If we try to plot the graph of dependent and independent variables it is mostly a straight line and not the curve.
In simple linear regression, value of dependent variable is calculated based upon the single independent variable.
Equation for the simple linear regression is as shown in image below.
- Here “Y” is the dependent variable plotted on Y-axis whose value we are trying to predict. It is a continuous quantity i.e it can be any integer value.
- “X” is the independent variable whose value we already know.
- “B0” is a “Y” intercept.
- “B1” is the slope of the regression line.
- Greenline shown here is the regression line.
- The distance between the predicted value and the actual value is known as the error in prediction. So our aim while solving the linear regression problem is to fit the regression line in such a way that the error should be least.
Simple Linear regression example:
Here we have used the “Diabetes” dataset that comes along with the sklearn library. The straight line in the plot shows how linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation.
The coefficients, the residual sum of squares and the coefficient of determination are also calculated.
import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model from sklearn.metrics import mean_squared_error, r2_score # Load the diabetes dataset diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True) # Use only one feature diabetes_X = diabetes_X[:, np.newaxis, 2] # Split the data into training/testing sets diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] # Split the targets into training/testing sets diabetes_y_train = diabetes_y[:-20] diabetes_y_test = diabetes_y[-20:] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) # Make predictions using the testing set diabetes_y_pred = regr.predict(diabetes_X_test) # The coefficients print('Coefficients: \n', regr.coef_) # The mean squared error print('Mean squared error: %.2f' % mean_squared_error(diabetes_y_test, diabetes_y_pred)) # The coefficient of determination: 1 is perfect prediction print('Coefficient of determination: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred)) # Plot outputs plt.scatter(diabetes_X_test, diabetes_y_test, color='black') plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3) plt.xticks(()) plt.yticks(()) plt.show()
- Coefficients: [938.23786125]
- Mean squared error: 2548.07
- Coefficient of determination: 0.47
Applications of linear regression:
Linear regression can be used in many fields for predictions. Here I have listed few of them;
- Risk analysis in Insurance Domain:
- The car insurance company can take the help of a linear regression technique. It can plot a suggested premium table using predicted claims to the Insured Declared Value ratio. The risk can be assessed based on the attributes of the car, driver information, or demographics. The results of such an analysis might guide important business decisions.
- Calculating Marketing effectiveness:
- Here companies can calculate the marketing campaign effectiveness by analyzing the money invested for the marketing and increase in the sales number. Linear regression also enables us to capture the isolated impacts of each of the marketing campaigns along with controlling the factors that could influence the sales. In real-life scenarios, there are multiple advertising campaigns that run during the same time period. Supposing two campaigns are run on TV and Radio in parallel, a linear regression can capture the isolated as well as the combined impact of running these ads together.
- Optimizing business processes:
- In a food processing industry, Linear regression can be used to build a model to understand the relationship between oven temperature and the shelf life of the cookies baked in those ovens. This relationship will help to adjust the oven temperature to optimum which apparently results in cost saving.
- Regression analysis is a statistical process of establishing a relationship between a dependent variable (unknown) and independent variables (known).
- Linear regression is a type of regression analysis. Here both the variables are in a linear relationship.
- Regression analysis can be classified based on factors like the number of independent variables, Shape of the regression line, Type of dependent variable, etc.
- The regression curve is obtained by drawing the best fitting curve that covers most of the data points i.e Minimum error between predicted and actual value.
- In linear regression method shape of the regression curve is a straight line whereas in other regression techniques it is non-linear.