What is a variable and what do we do with a variable? This and many more are just some of the things that we are going to learn in this article. These are the very basics of data science, but they are super important before you take a leap into topics that are much more complex. So, let’s start.
1. Every data scientist should at least approximately know a definition of a variable
Variable is any data item – any kind of attribute or characteristic – that can be measured or counted and can assume different values. A variable can be a place of birth, a level of education, income, a type of housing – a variable can be a place, a thing, it can be a person or even an idea. A variable should be the core of research and should therefore be clearly identified.
2. Variables are classified into two main categories
The two main variable categories are:
- 1. Qualitative or categorical
- 2. Quantitative or numerical
These types of variables have descriptive values. Qualitative or categorical variables only take values in a form of names or labels, such as the color of eyes or hair, a person’s name, nationality, or type of housing, or for example how much we like or dislike something.
These types of variables on the other hand are something we can count or measure in numbers, such as a person’s height or weight, age, number of children, income etc.
3. Categorical or qualitative variables can be either nominal or ordinal
As already explained in a previous paragraph, these types of variables have descriptive values and the characteristics of these variables cannot be quantifiable – they cannot be counted, or measured. In addition to that, there are two subcategories to categorical or qualitative variables:
- nominal variables
- ordinal variables
A nominal variable is a variable that describes a name or some category that does not have a natural order. A great example of a nominal variable would be sex, nationality, the color of eyes and hair, or a type of housing.
Compared to a nominal variable, an ordinal variable carries a type of natural order when it’s defined. In other words, values of an ordinal variable are defined by an order relation between the different categories. For example, when asking what do you think about a new car design, we can offer several answers that have a natural ordering, such as “I think the design is excellent”, “very good”, “good”, “bad”, “awful”. It is very natural to all of us that we know that when someone rates the design with “excellent” it is valued better than “very good” or “good”. However, there are limitations to these answers because we don’t really know what “excellent” means to someone who answered the question with “very good”.
4. We can “translate” categorical or qualitative variables into numbers
As we said categorical variables are not quantifiable, we can’t measure or count them. However, we can “translate” them into numbers in a way that correlation and similarity are achieved between these numbers and the categories during data coding. To illustrate what we’re talking about, let’s have a look at an example of a new car design. Answers we offered (“I think the design is excellent”, “very good”, “good”, “bad”, “awful”) could be easily translated to numbers, in terms of “rate the design of a car on a scale from one to five, where one stands for completely dissatisfied or the worst and five stands for very satisfied or excellent.” This nomenclature is very self-explanatory with ordinal variables, where the order comes naturally. This presentation of closed-ended questions is also called a Likert scale survey questions. On the other hand, we can even assign numbers to nominal variables which don’t have a natural order, but we have to have access to metadata and we have to define the code set for every categorical variable. At the end of the day, transforming categorical variables into numbers is far from obligatory, and if you decide to do that, we strongly suggest that you give each value a label so that you or anyone else looking at the data understands what each value represents.
In addition to that, let’s clarify that categorical variables can also contain numbers (without transformation as explained above), they do not always contain labels or strings. An example of such a case would be the class of a train in which you are traveling (first class, second class…). In this precise case, we’re talking about an ordinal categorical variable.
Another interesting aspect of categorical variables and date and time values. Numerical variables one would say. No, date is a categorical variable says another. Well, depends on the context.
Let’s take dates for example. Dates themselves are interval, a number between 1 and 31, however, they can be viewed differently. What if you have only a few days? Or if a day of a week is something that matters? Let’s say what’s the best day of a week to sell a car. You would transpose the date into a day of the week (Monday etc) and that would be a nominal variable that would make sense. You could also treat dates as ordinal – an example would be listing car models according to the date they were published. A general case scenario would be to treat dates as a continuous variable (numeric variable), because the starting point is arbitrary and the units are fixed and we are working within an interval and in addition to that there is no true zero. However you might well change dates to let’s say days since a particular event, and in this case, dates would become ratio. This is a common problem (and beauty) when working with statistics – if you apply rules blindly, you will quickly hit a problem, so yes, you have to think about what you are doing.
5. Quantitative or numerical variables can be additionally classified as discrete or continuous
As already explained in a previous paragraph, these types of variables have numeric values, so the characteristics are quantifiable – they can be counted, or measured. In addition to that, quantitative or numerical variables have two subcategories:
- continuous
- discrete variables
A continuous variable can take any value within an interval. For example, if we take a speed of a car, the value can’t be negative and it can’t be higher than let’s say 250 km/h. But between that – 0 and 250 km/h, the number of possible values is theoretically infinite. In practice, however, the accuracy of the measurement instrument will restrict the precision of the variable, so the reported speed would be rounded, usually to a lower number.
As opposed to a continuous variable, which can theoretically take any value within an interval, a discrete variable can assume only a finite number of real values within a given interval. The size of a car’s trunk in liters is a discrete variable or a number of gears or a number of people that can sit in a car. Usually, these are whole numbers.
6. Other classifications of variables
Almost 100 years ago, Stanley Smith Stevens introduced four scales of measurement: nominal, ordinal, interval, and ratio. These four scales are still widely used today as a way to describe the characteristics of a variable.
Nominal and ordinal variablesbelong to qualitative data, and we’ve already talked about them.
Interval and ratio variables belong to quantitative data:
- An interval scale is one where there is order and the difference between two values is meaningful. Speed of a car, the temperature of an engine at specific speed
- A ratio variable has all the properties of an interval variable, but it also has a clear definition of zero value. if you’re working with ratio variables, but not interval variables, the ratio of two measurements has a meaningful interpretation. The acceleration rate of a car could be a good example of a ratio variable.
7. Classification of variables according to the number of variables that are studied
In addition to the difference between quantitative and qualitative data, we can also mention that statistical data is often classified according to the number of variables that are studied. We are talking about the difference between univariate and bivariate or multivariate data. When we are looking at only one variable in a study, we say that we are working with univariate data. But, when we examine a relationship between two or more variables, we are working with bivariate or multivariate data. For example, if we conduct a study that examines a relationship between the speed of a car and the number of car’s gears, we would be working with bivariate data.
8. Why should you care about the type of variable you’re dealing with?
It is important to identify and understand the type of variable in a study because they are the basic units of the information. For this reason, scientists carefully analyze and interpret every variable and its values to make sense of how things relate to each other in a descriptive study or an experiment. Depending on the variable, you must choose the corresponding processing technique and statistical analysis, design your study, select your tests and interpret results. Let’s take a look at the visual presentation of data. If we analyze a single variable (univariate analysis) we can use a bar plot or a histogram, but if we analyze several variables (multivariate analysis), previously mentioned visual presentations are not appropriate. Instead, for multivariate analysis, we use the scatter plot, contour plots, multi-dimensional plots.