Deep Tech Point
first stop in your tech adventure

Removing Duplicates in MySQL

June 24, 2024 | AI

Removing duplicates in MySQL involves identifying and deleting rows with duplicate data to ensure the integrity and accuracy of your database. This article will serve as a step-by-step guide on how to remove duplicates in MySQL.

1. Identify Duplicates

First, determine which rows are duplicates based on specific criteria. For instance, if you have a table employees and you want to find duplicates based on the email column:

SELECT email, COUNT(*)
FROM employees
GROUP BY email
HAVING COUNT(*) > 1;

2. View Duplicate Records

To view the full rows that are duplicates:

SELECT *
FROM employees e
WHERE (SELECT COUNT(*) FROM employees e2 WHERE e.email = e2.email) > 1;

SELECT * FROM employees e part selects all columns from the employees table, which is aliased as e. While the subquery in WHERE clause SELECT COUNT(*) FROM employees e2 WHERE e.email = e2.email) > 1 counts the number of rows in the employees table (aliased as e2) that have the same email value as the current row (e).

e serves as the alias for the main query’s reference to the employees table, while e2 is the alias for the subquery’s reference to the employees table.

If the count is greater than 1, it means there are duplicate email values. And we can filter rows with WHERE clause where the subquery’s count is not greater than 1. Therefore, only rows with duplicate email values are selected.

3. Remove Duplicates

The process for removing duplicates with MySQL typically involves deleting rows while keeping one instance of each duplicate. This can be done using subqueries or temporary tables.

Method 1: Using a Subquery with DELETE

The query below deletes rows from the employees table that have duplicate email values, in this case keeping the row with the smallest id and removing the rest:

DELETE e1
FROM employees e1
INNER JOIN employees e2 
WHERE 
    e1.id > e2.id 
    AND e1.email = e2.email;

Let’s take a look at the code. DELETE e1 specifies that rows from the employees table, aliased as e1, will be deleted. In the second row FROM employees e1 INNER JOIN employees e2 the code joins the employees table to itself (self-join), creating aliases e1 and e2.
WHERE: e1.id > e2.id ensures that for each pair of duplicate emails, only the row with the larger id is considered for deletion. AND e1.email = e2.email matches rows with the same email.

Method 2: Using Temporary Table

This method of removing duplicates in MySQL is has a few steps divided into three main ones:
1. Create a temporary table with unique id for each email.
2. Delete all records from the employees table.
3. Insert the unique rows back into the employees table, effectively removing duplicates.

Let’s have a look at them in more detail.

1. Create a Temporary Table

Let’s say we want to identify and keep only the unique rows based on the email column. We create a temporary table temp_table, and then select the minimum id for each unique email from the employees table.

CREATE TEMPORARY TABLE temp_table AS
SELECT MIN(id) as id
FROM employees
GROUP BY email;

2. Delete All Records from the Original Table

Clear the employees table to remove all existing records.

DELETE FROM employees;

3. Insert Unique Rows Back into the Original Table

In this step we will repopulate the employees table with the unique rows. We will select all columns (e.*) from the original employees table, and then we will use a join between temp_table and employees on the id to retrieve and insert only the unique rows.

INSERT INTO employees
SELECT e.*
FROM temp_table t
JOIN employees e ON t.id = e.id;

4. Using Common Table Expressions (CTEs)

Common Table Expressions (CTEs) can be used to remove duplicates in MySQL 8.0 and later versions. This code uses a CTE to remove duplicate rows from the employees table based on the email column, keeping the row with the smallest id for each email.

WITH CTE AS (
    SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) as row_num
    FROM employees
)
DELETE FROM employees
WHERE id IN (SELECT id FROM CTE WHERE row_num > 1);

In the first step we create CTE and assign a row number to each row within each email group, ordered by id, and then we delete duplicates by deleting rows from employees where the row number is greater than 1, while keeping only the row with row_num = 1 (the smallest id for each email).

5. Preventing Future Duplicates

To avoid duplicates in the future, consider adding unique constraints or indexes – this will help maintain data integrity and guarantee that, in our case, no two rows in the employees table can have the same email value, therefore preventing duplicate entries for the email column.

ALTER TABLE employees
ADD CONSTRAINT unique_email UNIQUE (email);

ALTER TABLE employees indicates that the employees table is being modified. ADD CONSTRAINT unique_email adds a new constraint named unique_email, and UNIQUE (email) specifies that the email column must have unique values across all rows in the employees table.

In conclusion

Removing duplicates in MySQL involves identifying the criteria for duplication, selecting the rows that meet these criteria, and deleting the redundant records while keeping at least one instance of each duplicate. This can be achieved through various methods including subqueries, temporary tables, and CTEs. Preventing future duplicates can be ensured by adding unique constraints or indexes to the relevant columns.