In this article, we are going to learn about one of the most significant predictive analytics tools for machine learning and big data – regression. We are going to define it, learn why and in which cases we use it. We are also going to take a look at seven types of regression analysis – we are going to learn which variables are correlated with specific regression techniques and we are also going to discuss some of the key factors associated with each technique.
What is regression?
Regression is one of the leading predictive analytics tools for machine learning and big data. Usually, data scientists are familiar with linear and logistic regression because they are commonly used in the real world. But the fact is there are more than a dozen types of regression algorithms – each invented for various types of analysis and each carrying its own importance. In addition to these, you can experiment with formulas and come up with your own, new regression algorithm.
Regression analysis is a form of modeling technique in predictive analysis that researches the correlation between variables. A primary goal is to understand a relationship between two variables – a dependent variable that represents the outcome or the target and an independent variable or predictor that mirrors the action. The second goal of regression analysis, nevertheless an important one, is to understand the strength of the relationship between a dependent and independent variable.
Correlation analysis as a prerequisite to regression
A regression analysis has a cousin – a correlation analysis, which is also supported with a scatter plot and a regression line – and they are often implemented together with a regression analysis. Correlation analysis does not make any assumptions about one variable being independent and the other dependent, as regression analysis does. Correlation analysis is focused only on the strength and direction of the relationship between two or more variables. Regression analysis, on the other hand, hypothesizes one or more variables are independent and they have a causal relationship with dependent variables.
Why do we use regression analysis?
Three major uses for regression analysis are determining the strength of predictors, forecasting an effect, and trend forecasting.
Regression analysis is used for:
- forecasting
- time series modeling, and
- finding and evaluating connections between variables.
As the name suggests forecasting is about predicting the outcome – either weather, consumer behavior, or levels of crime. Forecasting models are often used in sales. We are familiar with several forecasting methods and one of them is time series modeling.
With time series modeling we deal with serially correlated data that is – you guessed it – based on time – either years, months, days, and even hours or minutes. We use time series modeling as a forecasting model so we can understand data and eventually acquire hidden insights and make informed decisions.
With regression analysis, we want to understand a relationship between dependent and independent variables and evaluate the strength of such a relationship.
We’ve said it before – we use regression analysis to understand the relationship between two or more variables. Maybe an example could illustrate what we’re talking about.
Let’s say you’ve been gaining weight over the last few years, and if you don’t plan to change your eating and exercise habits and continue to put on weight at the same rate, you can predict with a simple linear regression how much weight you’ll gain in the next five years.
Or, let’s say you want to evaluate what will be the growth of your product sales in the existing economy. Based on the data from past years, you can also forecast what will be future sales, if conditions in the economy and your approach to business stay the same.
Let’s say you want to research whether socioeconomic status affects achievement in education, whether IQ influences earnings, or does exercising affects your weight. These are all examples where we can use regression analysis – where we evaluate the impact of the independent variable on a dependent one and where we also try to estimate what is the strength of that variable relationship.
Types of regression techniques
We usually choose a regression problem when the output variable is a real or continuous value, such as sales, salary, or even weight. But is it that simple? Which type of regression should we choose when trying to predict future sales, salary, or even life expectancy with a specific disease?? As we said at the beginning, there is so much more to regression techniques than linear and logistic regression – there are more than a dozen types of regression techniques with each technique having its own regression equation and regression coefficients. In general, every industry has more or less specific regression analysis that is the most common. In medical research, for example, Cox regression is often used because its application is in relation to the survival terms or life expectancy or “time to event” data. In addition to this, simple or multiple linear and logistic regression are definitely the ones that are most commonly used.
But, how do we know when to use a specific type of regression technique to make a prediction? Regression techniques are driven by three metrics:
- a number and type of independent variables
- a number and type of dependent variables
- shape of a regression line
So, there are various kinds of regression techniques available to make predictions, and these techniques are driven by the three metrics mentioned above. Of course, you can even create new regression models, but let’s put this aside for now and focus on the regression techniques that are most commonly used:
- We use simple linear regression when we have one dependent and one independent variable that is continuous and normally distributed. The relationship between variables is linear.
Linear regression is the regression technique that is the most basic and most commonly used. An example when we could use a simple linear regression technique would be – does the risk of developing high blood pressure increase with a person’s age? So, both variables are continuous (we are dealing with quantitative data) and a line that defines a relationship between variables is a linear straight line.
- We use multiple linear regression when we have one dependent and continuous variable and two or even more independent variables that can be either continuous or categorical. The relationship between variables is linear.
Multiple linear regression is similar to simple linear regression – in both simple and multiple linear regression we are dealing with one continuous dependent variable, but with multiple linear regression we are working with at least two independent variables that can be either quantitative or qualitative. If we upgrade the example from a simple linear regression, the question for our research could be: If a person has both high blood pressure and diabetes, what is the likelihood a patient develops heart-related disease? Here, we are dealing with two independent variables that can be expressed with either continuous data (we take numeric values for blood pressure and numeric values for blood sugar), or qualitative data, where we define presence or absence of high blood pressure and diabetes. Nevertheless, in order to apply multiple linear regression, there must be a linear relationship between the dependent and independent variables.
- We use logistic regression when we have one binary dependent variable and two or more independent variables that can be either continuous or categorical in nature. The relationship between variables can be linear, but this linearity is not a condition for using logistic regression.
Binary means the possible outcome of the variable is yes or no, 0 or 1, dead or alive – so, two possible outcomes, and no more. For instance, if we are interested whether a person will survive after surgery – a possible outcome of a surgery is either dead or alive, so we are dealing with a binary dependent variable. And of course there are a few predictor variables that have an impact on that binary variable, and these predictive variables can be either quantitative or qualitative, for example person’s age and sex, presence of other diseases etc – these again can be assessed as disease being present or absent, and in case of disease presence the values relevant for a specific disease can be expressed with numbers.
Logistic or logit regression does not have a linear relationship. A linear relationship is possible, but it is not obligatory to have a linear relationship between dependent and independent or predictor variables.
It is also important to bring out that if we have several independent variables, they should not correlate among each other because this can give us inaccurate results. A good example for this case could be: will the person have sepsis after surgery (we are dealing with a dependent binary variable – sepsis yes or no) if we take into account several independent variables, such as duration of surgery, how much blood did the patient loose and what are his hemoglobin levels one week after surgery. Every medical person will notice these independent variables are very connected among each other and if we take into account all of them, we will get a false result. In this case, we must decide for only one independent variable, obviously the one with the biggest impact, and we could say the duration of a surgery could be that one.
Another factor that should be taken into consideration with logistic regression is sample size required so we can attain adequate statistical power. Some authors suggest that maximum likelihood estimation including logistic regression with less than 100 cases is described as risky and that in most experiments 500 cases is generally adequate, with at at least 10 cases per predictor. - We use polynomial or multinomial regression when we are working with a non-binary dependent variable, and two or more independent variables are either continuous or categorical in nature. The same as with logistic regression, the relationship between variables can be linear, but this linearity is not a condition to be able to use polynomial or multinomial regression.
We apply polynomial or multinomial regression when the dependent variable is non-binary and has more than two possibilities.
An important aspect of polynomial regression is over-fitting because there might be a temptation to fit a higher degree polynomial to get a lower error, this is why it is recommendable to plot the relationships to see the fit and focus on making sure that the curve fits the nature of the problem.
In addition to that, be cautious for curves towards the ends and be careful if those shapes and trends make sense because higher polynomials can cause weird results on extrapolation. - We use stepwise regression when we deal with multiple independent variables, however the selection of independent variables is done without human intervention. The process is automatic, done by observing statistical values like R-square, t-stats and AIC metric to detect significant variables, simply by adding or dropping co-variates one at a time based on a specified criterion. Some of the most commonly used stepwise regression methods are Standard stepwise regression (which adds and removes predictors as needed for each step); Forward selection (which starts with the most significant predictor in the model and adds variable for each step); Backward elimination (which works in an opposite way and starts with all predictors in the model and then removes the least significant variable for each step).
- We use ridge regression when independent variables are highly correlated. So, when your data suffer from multicollinearity, even though the least-squares estimates (OLS) are unbiased, their variances are often large and the consequence is that the observed value will deviate far from the true value. The biggest advantage of ridge regression is that it reduces the standard errors because it adds a degree of bias to the regression estimates.
- Cox or proportional hazards regression is often used in medical research, where the dependent variable represents the time to an event (aka survival time), and we have two or more predictors or independent variables that may be continuous or categorical in nature. The relationship between variables is rarely linear.
With this regression, we investigate what is the effect of several independent variables upon the time a specified event takes to happen, in our case death. This method is not truly nonparametric because it does assume that the effects of the predictor variables upon survival are constant over time and are additive on one scale.
So, the big question – how to choose the right regression model?
Yes, we know, so many options, and imagine if we had listed all regression model that exist. Some people follow the logic – if the outcome is continuous, use linear regression, if the outcome is binary, use logistic regression. But is it really that simple?
The first thing you should take into account when choosing the right type of regression model is to explore your data – identify the relationship and impact of variables, type of independent and dependent variables, dimensionality and other essential characteristics of the data.
Another important approach that can reveal the appropriateness of a regression model is analyzing different metrics. There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model – Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Appropriateness of a regression model can also be measured with R-square, Adjusted r-square, AIC, BIC, error term, Mallow’s Cp criterion…
Another excellent approach to evaluating regression models is Cross-validation. With this approach you divide your data set into two groups – train and validation group. Then you take a simple mean squared difference between the observed and predicted values and they give you a measure for the prediction accuracy.
Another thing to keep in mind is that if your data set has multiple confounding variables, you should not choose an automatic model selection method because you do not want to put these variables in a model at the same time because confounding variables affect other variables in a way that produces spurious or distorted associations between two variables and therefore produce false results.
In addition to all said, the selection of the best-fitted regression model also depends on your objective – obviously a less powerful model is easy to implement compared to a highly statistically significant, but also more complex model.