Linear regression is also known as ordinary least squares (OLS) and linear least squares, and it opens the doors into the regression world. Linear regression is one of the most widely known modeling techniques and is usually among the first few topics that people master when they learn predictive modeling. We differentiate between a simple and multiple linear regression, and in this article, we’re going to focus on these two.
What are a simple and multiple linear regressions?
So, a simple linear regression is one of the most basic and commonly used regression techniques, but what are some examples from the life of when we can use linear regression? In real life, businesses often use linear regression to evaluate a relationship between advertising and their revenue, or scientists use it to understand the relationship between specific drug dosage and let’s say patients’ blood sugar or when they want to evaluate the effect of fertilizer on crop yields. For sure, you can use simple linear regression when you have one dependent and one independent variable when both variables are continuous and a line that represents the relationship is a straight linear line.
So, when do we use linear regression? We use this modeling technique when we have a relationship between a dependent variable (Y) and one (simple linear regression) or more (multiple linear regression) independent variables (X) using a best fit straight line (also known as a regression line). Despite the term “linear model,” this type of regression can model curvature. So, we use the linear regression when:
- the dependent variable is continuous,
- the independent variable(s) can be continuous or discrete,
- the nature of a regression line is linear.
Linear regression equation
The equation for linear regression that can be used to predict the value of the dependent variable (Y) based on the given independent variable(X) is:
Y = a + b*X + e
- Y is the value of the dependent variable, a value that is being predicted or explained
- a or Alpha equals the value of Y when the value of X = 0 – a stands for the intercept
- b or Beta is a slope of the regression line; it is the coefficient of X – how much Y changes for each one-unit change in X – we use linear regression to understand the mean change in a dependent variable (Y) given a one-unit change in each independent variable (X).
- X is the value of the independent variable, a value that is predicting or explaining the value of Y
- e is the error term; the error in predicting the value of Y, given the value of X
How do we fit the best linear regression line?
The regression line is defined with values of a and b. We calculate the line for the available data with the Least Square Method (this is why the simple regression is also called the Linear Least Squares technique). There are other methods, too, but this method is probably the most common method used for drawing the best-fit regression line. So, we minimize the sum of the squares of the vertical deviations from each data point to the line. All deviations are first squared, so when they are added, they do not cancel each other because of the positive and negative values.
Linear models are one of the most common forms of regression and when we have a continuous dependent variable, linear regression is probably the first type of regression model we should consider. However, despite the term “linear model”, we can use polynomials and add curvature to the linear regression line. This way we include the interaction effects of the variables.
What about outliers in linear regression?
In regression, we call outliers all points that fall far away from the “cloud” of other points. These points can be especially important because they can have a strong influence on the least-squares line. However, not all outliers have the same impact – some are more important and influential than others. For example, it is very important to observe points that are positioned horizontally on the line, but away from the center of the cloud. They tend to have a strong influence on the slope of the least-squares line – they pull harder on the line – therefore we call them points with high leverage. Moreover, when these high leverage points have such an impact on the slope of the linear line that if we had fitted the line without that specific outlier, that specific point would have been incredibly far away from the least-squares lines. In these cases, we call these high leverage points – influential points, because they influence the slope of the least-squares line.
Should you simply just remove the outliers? This seems like a very inviting thing to do, but don’t throw out data for no reason other than it makes the data look bad. A rule of thumb is that those final models that ignore exceptions usually perform badly. Exceptions are usually there for a reason and whatever final model you create, it should be capable of including outliers.
Another thing to be careful about is using a categorical predictor when one of the levels has a small number of points because they can become influential points.
Linear regression and the key takeaway points
- A relationship between the independent and dependent variables must be linear, which means the data points create a straight line and not a curve or a grouping factor.
- The dependent variable is continuous, independent variable(s) can be continuous or discrete
- Outliers – observations that fall far from the “cloud” of points – can significantly affect the least squares line and the forecasted values.
- Simple linear regression belongs to a group of parametric tests. FOr this reason, assumptions, such as homoscedasticity (the size of an error stays more or less the same across all values for the independent variable), and normal distribution is applied in the data, which must be followed. If your data does not meet assumptions of homoscedasticity and normality, it is better to use a nonparametric test instead, such as the Spearman rank test.
- Multiple regression is sensitive to multicollinearity, autocorrelation, heteroskedasticity therefore estimates are very unstable and sensitive to minor changes.
- When we are working with multiple independent variables, we can select the most significant independent variables with a forward ranking, backward elimination, and a step-wise regression.