In this article, we are going to take a look at logistic or logit regression, and we are going to learn the differences between binary, multinomial, and ordinal logistic regression.
What is logistic regression?
Like all other regression analyses (simple and multiple linear, polynomial or multinomial, and other regressions), logistic regression is a part of predictive analysis.
We use logistic regression when we want to explain the relationship between the dependent variable, which is binary or dichotomous (which means the possible answers to analysis are yes or no, present or absent, 0 or 1), and one or more independent variables, which can be nominal, ordinal, interval or ratio-level variable(s).
The main premises when using logistic regression are:
- the dependent variable is finite or categorical: either A or B (binary regression) or a variable can include a range of finite options A, B, C or D (multinomial regression).
- logistic regression is used to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation.
- outliers in the data should not be present – the continuous predictors are converted to standardized scores and values that are below or above the threshold are removed
- multicollinearity or high correlations among the predictors should not be present – this issue can be evaluated by a correlation matrix among the predictors, where to satisfy this assumption the correlation coefficients among independent variables should be less than 0.90.
- be careful with overfitting because when we add independent variables to a logistic regression model that will always increase the amount of variance explained in the log odds; the problem is that when we add more and more variables to the model, that can result in reduced generalizability of the model beyond the data on which the model is fit and creates overfitting. Tests, such as the Hosmer-Lemeshow test, which is based on the Chi-square test, evaluate the goodness of fit.
What are some examples from the life of when we can use logistic regression?
As already said, we use logistic regression when we want to predict what is the probability of a specific event happening.
- Let’s say, we want to estimate what is the probability of a customer buying a product on our website. The act of buying or not buying a product is a dependent variable – we have two options – a customer bought or didn’t buy a product, so yes or no. However, the characteristics of a customer, such as repeat visits or even repeat spending on our website, customer’s behavior on our website, what website the customer came from, even his age and sex etc. – are all variables that act as independent variables in this predictive analysis. Determining the probability of what type of customer will buy a product is the goal of the logistic regression.
- In manufacturing, we can estimate what is the probability between part failures in machines and the length of time those parts are held in inventory? With the information the manufacturer receives from this analysis, he can decide how to adjust delivery schedules and installation times to eliminate future failures.
- In the banking sector, a loan officer wants to know whether the customer is likely to stop making payments on a loan or not. This way a binary analysis can help assess the risk of extending credit to a particular customer.
- In medicine, logistic regression analysis can be used to predict the likelihood of disease or illness for a given population, for example:
– How does the probability of getting lung cancer change for every additional pack of cigarettes smoked per day?
– How does the probability of getting prostate cancer change for every pound a man is overweight?
– Does body weight have an influence on the probability of having a heart attack?
Types of logistic regression
In general, there are three types of logistic regression:
- 1. Binary logistic regression
We use binary logistic regression when we want to model the event probability for a categorical response variable with two possible outcomes. As mentioned above – in the banking sector, the estimation of whether a person will default on a credit card payment or not helps evaluate the risk of extending credit to a customer. So, the main assumption when we use binary logistic regression is that the target or dependent variable is binary, which means it can take only two values. Despite the fact that with binary logistic regression the outcomes are constrained to two, this type of predictive analysis can be used in sports statistics, medicine, evaluation of landslide hazard, and even in handwriting analysis.
- 2. Multinomial logistic regression
Multinomial logistic regression is a simple upgrade of binary logistic regression – if we compare binary and multinomial logistic regression, with the latter we have three or more categories without ordering – more than two categories of the dependent or outcome variable – so multinomial logistic regression can be used to classify subjects into more than two groups or profiles based on a categorical range of variables to predict behavior. As with binary regression, when preparing multinomial logistic regression, it is important to evaluate sample size – the recommended standard is at least 10 cases per independent variable. Other important aspects are multicollinearity and the examination of outliers. Multicollinearity, for example, can be estimated with simple correlations among the independent variables, and standard multiple regression is applied to exclude outliers and especially influential cases.
As an example, participants select one of several foods as their favorite (A, B, and C), and based on that we can create profiles of people who are most likely to be interested in a type of food – nonvegetarian, vegetarian and vegan food. This way we can create profiles of people that are most likely interested in a specific product and we can plan marketing strategy according to the profiles.
- 3. Ordinal logistic regression
Ordinal logistic regression evaluates the relationship between an ordinal dependant variable and one or more independent variables. A dependant or response variable must be ordinal, which is similar to a categorical variable, the difference however is that with the ordinal variable there is a clear ordering of the categories. The independent or explanatory variables on the other hand can be either continuous or categorical.
An important assumption in ordinal logistic regression is that the odds are proportional and the effect of independent or explanatory variables is constant whenever there is an increase in the level of the response. This is why the ordinal logistic regression is an extension of a binary logistic regression where the log odds of a binary response are linearly related to the independent variables.
An example of where we can use ordered logistic regression is when we for instance want to investigate how comfortable are the newly engineered seats in a car. We can have a scale from 1 to 5 with 1 being the most uncomfortable car seats and 5 being the most comfortable car seats. The independent or explanatory variables of interest could be the gender and age of the participants.
Another example could come from the world of sports – what are the impact factors in junior skiing in winning bronze, silver, and gold medals. We assume that the relevant predictors include the age at which the skier started training, the number of training hours, diet, the proximity of slopes, popularity of skiing in the home country, and the number of hours spent on slopes during the summer.
When to use multinomial and when ordinal logistic regression?
On some occasions deciding between multinomial and ordinal logistic regression can be tricky in practice because both multinomial and ordinal models are used for categorical outcomes that contain more than two categories. The main decision criterium is that with the ordinal model we are dealing with variables that have some sort of categories of order (for example – bronze, silver, gold medal or uncomfortable, comfortable, very comfortable car seats etc.), while with the multinomial model the main criterium is that the outcome is nominal and the categories have no order.
However, sometimes things aren’t as simple as that.
We should bring out that there are quite a few ordinal outcomes that describe the ordering of the outcome categories in different ways, but only one logistic regression model suitable for nominal outcomes. Luckily, the majority of software offers only one model for each of the nominal and ordinal outcomes.
Another issue we should bring out regarding ordinal outcomes is the proportional odds model, which as the name suggests has an assumption that is rarely met in real-life data – that the odds assumption is proportional and the lines assumptions are parallel, which means that the predictors’ effect on the likelihood of shifting to a higher-order category along the scale is the same.
To sum up, if you are dealing with a nominal result, you should be careful not to run a model that is ordinal. That’s an obvious one. However, if you are dealing with an ordinal outcome with a proportional odds assumption, we recommend running the cumulative logit version of ordinal logistic regression. However, if you’re dealing with an ordinal outcome, but the proportional odds assumption isn’t met, you can run a different ordinal model, or surprisingly, you can still try running a nominal model if it answers your research question.