Data cleaning is an essential step in the data analysis process. It involves preparing and transforming raw data into a more useful and accurate format. SQL (Structured Query Language) is a powerful tool for data cleaning because it can handle large datasets efficiently and provides various functions and operations to manipulate data.
Basic Steps in Data Cleaning with SQL
Removing Duplicates
Duplicate records can distort your analysis. You can use the DISTINCT keyword or the ROW_NUMBER window function to remove duplicates:
DISTINCT
SELECT DISTINCT column1, column2, ...
FROM table_name;
The DISTINCT keyword is used to remove duplicate rows from the result set. So, when you run this query, the database engine scans the specified columns (column1, column2, …) in table_name. It returns only unique combinations of these columns. Any duplicate rows (rows where the values in all specified columns are identical) are removed from the result set.
This method is straightforward and efficient when you simply want to retrieve a unique set of rows based on the specified columns. It does not modify the original table; it only affects the result of the query.
ROW_NUMBER
WITH CTE AS (
SELECT column1, column2, ...,
ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY column2) AS row_num
FROM table_name
)
DELETE FROM CTE WHERE row_num > 1;
This method uses a Common Table Expression (CTE) with the ROW_NUMBER() window function to identify and remove duplicate rows from a table. Within the CTE, the ROW_NUMBER() function is used. This function assigns a unique number to each row within the partition of column1, ordered by column2. ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY column2) resets the row number for each partition (each unique value of column1), starting from 1 for each partition. For rows with the same column1 value, the rows are ordered by column2, and ROW_NUMBER() assigns them sequential numbers. After the CTE is created, a DELETE operation is performed on it and the DELETE statement targets rows where row_num is greater than 1. This means that for each set of duplicate rows (rows with the same column1), only the first row (row_num = 1) is kept, and all subsequent rows (duplicates) are deleted.
This method is useful when you want to remove duplicates directly from the table. It provides more control over which duplicates to keep (e.g., based on ordering by column2). By deleting duplicate rows we modify the original table.
The simplest way to remove duplicates in a query result is with DISTINCT keyword. This method works well when you want to select unique records based on specific columns. Aggregate functions can help in identifying unique records; for example, “GROUP BY” method groups the records based on specified columns, effectively removing duplicates in the result set. And when you need more control over which duplicate to keep, ROW_NUMBER() can be useful.
For those interested in more information about removing duplicates in MySQL we suggest reading the article in the link.
Handling Missing Values
Handling missing values is a crucial step in data cleaning to ensure the integrity and accuracy of your data analysis. Missing values can arise from various sources, such as data entry errors, data corruption, or incomplete data collection. SQL provides several methods to handle missing values effectively. Here are some common techniques for handling missing values in SQL:
- Identifying Missing Values
- Replacing Missing Values
- Removing Rows with Missing Values
- Using Default Values
- Imputing Missing Values
- Advanced Imputation Techniques
Before you handle missing values, you need to identify them. In SQL, missing values are represented by NULL. You can use the IS NULL or IS NOT NULL operators to find missing or non-missing values. Next step is replacing and eventually removing rows with missing values. In case you want to use default values, you can set default values for columns to automatically handle missing values when you’re creating or altering a table. And, in case of imputing missing values, the imputation involves replacing missing values with substituted values derived from other data points. Basic imputation techniques include using the mean, median, or mode of the column. However, for more sophisticated imputation methods, you might need to use additional tools or write complex queries. In these cases, techniques such as regression imputation, k-nearest neighbors imputation, or machine learning models can provide better estimations for missing values.
Below you will find a simple example how to replace and remove missing values:
-- Replace missing values
UPDATE table_name
SET column1 = 'default_value'
WHERE column1 IS NULL;
-- Remove rows with missing values
DELETE FROM table_name
WHERE column1 IS NULL;
In the first part of the code snippet above we are replacing missing values (NULL) in column1 with a specific default value (‘default_value’): The UPDATE statement is used to modify existing records in table_name, and The SET column1 = ‘default_value’ clause specifies that column1 should be updated to the value ‘default_value’. The WHERE column1 IS NULL condition ensures that only rows where column1 is currently NULL will be updated.
This approach is useful when you want to ensure that there are no missing values in column1 by replacing them with a meaningful default value. This can be helpful if NULL values are problematic for analysis or processing.
In the second part of the code snippet we are removing all rows with missing values where column1 still contains NULL values. The DELETE statement is used to remove records from table_name, and rhe WHERE column1 IS NULL condition ensures that only rows where column1 is NULL will be deleted.
This approach is used when rows with missing values in column1 are considered invalid or unneeded, and you want to completely remove such rows from the dataset.
If you would like to know more about handling missing values in MySQL, we suggest reading the article in the link.
Standardizing Data
Standardizing data is an important step in the data cleaning process. It ensures consistency and uniformity within the dataset, which is crucial for accurate analysis, reporting, and machine learning models. Standardizing data can involve various transformations and operations depending on the nature of the data and the requirements of the analysis. The key aspects of data standardization are:
- Consistent Data Formats
- Uniform Text Case
- Removing Unwanted Characters
- Correcting Data Types
- Standardizing Numerical Values
- Ensuring Consistent Categories
- Handling Date and Time Values
The main goal of standardizing data is to ensure data follows a consistent format, such as date formats, capitalization, and removing unwanted characters. Below you will find an example of converting text to uppercase and converting date formats:
-- Convert text to uppercase
UPDATE table_name
SET column1 = UPPER(column1);
The code above is used to convert all the text values in column1 of table_name to uppercase. This helps achieve consistency and standardization in the dataset. The UPPER() function is applied to each value in column1, replacing it with its uppercase equivalent.
-- Convert date formats
UPDATE table_name
SET date_column = TO_DATE(date_column, 'MM/DD/YYYY');
The SQL statement above is used to convert the values in a column (date_column) of a table (table_name) to a specific date format (MM/DD/YYYY). The TO_DATE function is used to parse and convert the text representation of a date to an actual date data type in SQL, it is usually used with Oracle SQL and PostgreSQL, in MySQL, you can change the format of a date using the DATE_FORMAT() function. This function allows you to format a date value based on a specifier that defines the output format:
DATE_FORMAT(date, format_mask)
If you want to permanently change how dates are stored (which is generally not recommended because it’s better to store dates in their native format and only format them when displaying), you would need to actually convert the date to a string in the new format and store it in a new varchar-type column. You could do that by adding a new column to store the formatted date string and then update the new column with formatted dates. This way, the original date column remains unchanged – and you can still perform date calculations; but you also have a new column with dates in the desired format as strings. However, keep in mind that once you convert a date to a string, you lose the ability to easily perform date calculations or comparisons that you can do with actual date or datetime types.
Filtering Out Unwanted Data
Filtering out unwanted data is a crucial part of the data cleaning process. This step involves identifying and removing records that are incomplete, incorrect, irrelevant, or otherwise unsuitable for analysis. By filtering out unwanted data, you improve the quality of your dataset, making it more accurate and reliable for further analysis. Here are some common techniques and SQL commands used to filter out unwanted data:
- Removing Null or Missing Values
- Filtering Based on Conditions
- Removing Duplicates
- Filtering Outliers
- Dropping Irrelevant Columns
- Using Joins to Exclude Unwanted Data
In our example we will use filtering based on a specific condition or criteria, such as filtering based on ranges, specific values, or other logical conditions. In our example we will remove rows where the value in column1 is less than 10:
DELETE FROM table_name
WHERE column1 < 10;
Correcting Data Types
In data cleaning and preparation correcting data types is important. Ensuring that each column in your dataset has the correct data type helps maintain data integrity and facilitates accurate analysis. Data type correction involves converting data from one type to another to match the expected format or improve the efficiency of data storage and processing. Common scenarios for correcting data types are:
- Converting Text to Numeric Types
- Converting Text to Dates
- Converting Between Numeric Types
- Converting Boolean Representations
- Correcting Data Types in Mixed-Type Columns
When you're ensuring that each column has the correct data type, there are a few important considerations to take into account when dealing with data compatibility. Ensure that the existing data in the column is compatible with the new data type. For example, converting text to an integer will fail if the text contains non-numeric characters. Also, be aware that some conversions can result in data loss. For example, converting a decimal to an integer will lose the fractional part.
When you're dealing with default values and constraints, the ALTER COLUMN statement may need to consider existing default values and constraints, which might need to be adjusted or dropped before the conversion.
Also, in case you're altering large tables, database performance should be monitored, because it can be resource-intensive and time-consuming. It might require careful planning and execution during off-peak hours to minimize the impact on database performance.
Let's take a look how we can convert column price from text to a decimal type, here is an example for PostgreSQL:
ALTER TABLE products
ALTER COLUMN price TYPE DECIMAL USING price::DECIMAL;
In MySQL, the syntax for altering a column's type is different. Here’s how you can change the price column in the products table to the DECIMAL type in MySQL:
ALTER TABLE products
MODIFY COLUMN price DECIMAL;
If you need to specify the precision and scale for the DECIMAL type (e.g., DECIMAL(10,2)), you should include that in the command, and it will change the price column to a DECIMAL type with a precision of 10 digits in total and 2 digits after the decimal point. For instance:
ALTER TABLE products
MODIFY COLUMN price DECIMAL(10,2);
If your column type change involves a conversion of data that requires specific handling, you might need additional steps, such as creating a temporary column to hold the converted data. However, for a straightforward type change to DECIMAL, the MODIFY COLUMN syntax should suffice.
Joining Tables in SQL to Enrich Data
We can combine data from multiple tables to get a complete dataset. Joining tables is a fundamental operation in SQL used to combine data from two or more tables based on a related column between them. This technique is essential for enriching data, enabling more comprehensive analysis, and creating complex queries that draw on multiple data sources.
SELECT a.column1, b.column2
FROM table_a a
JOIN table_b b ON a.id = b.id;
In the example above the SQL query retrieves data from two tables, table_a and table_b, by performing an inner join based on a common column id. It selects column1 from table_a and column2 from table_b for the rows where the id values match in both tables.
Joining tables is a powerful technique in SQL that allows you to combine data from multiple sources to enrich your dataset. Understanding the different types of joins (inner, left, right, full, cross, and self joins) and their use cases enables you to perform complex queries and derive meaningful insights from your data. Properly leveraging joins helps in creating comprehensive and accurate datasets for analysis and reporting.
Removing Outliers
Removing outliers is an important step in data cleaning and preparation - this way you are removing data points that differ significantly from other observations. They can skew and mislead the training process of machine learning models and affect the accuracy of predictions and analysis. Outliers can be the result of variability in the data or errors in data collection. Some of the most common techniques and SQL methods to identify and remove outliers are:
- Statistical Methods, such as Z-score or Interquartile Range (IQR)
- Visualization Techniques, such as box plot or scatter plot
- Domain Knowledge can help identify outliers that may not be apparent through statistical methods alone. This is why understanding the context of the data allows you to set logical bounds for what constitutes an outlier.
In the example below you'll delete sales where the amount is greater than 5000, considering these as outliers.
DELETE FROM sales
WHERE amount > 5000;
In conclusion
Data cleaning is a vital part of the data analysis process, transforming raw data into a more usable and accurate format. SQL provides robust tools to efficiently handle large datasets, remove duplicates, handle missing values, standardize data, filter out unwanted records, correct data types, and join tables to enrich datasets. By applying these techniques, you ensure that your data is consistent, reliable, and ready for analysis. Removing outliers further refines the dataset, preventing skewed results and enhancing the accuracy of machine learning models and statistical analyses. Proper data cleaning lays the foundation for insightful and accurate data-driven decision-making.