How To Remove Duplicate Records in SQL
How To Remove Duplicate Records in SQL
com
Do you need to use SQL to remove duplicates in your tables? Learn how to write an SQL query
to remove duplicate data, and see the performance of each way you can do it, in this article.
The way you define duplicate data could be dependent on your data.
There are several ways to do it. I’ll explain each of these methods and show you how they
perform on a random sample of data.
The sample of data has 1,000,000 records in a single table, which looks like this:
Let’s say that a record is a duplicate if it contains the same first_name, last_name, and
created_date values.
In each of these examples I explain the code I am using, what it does, and show how long it takes
to run the DELETE statement.
However, the query times should only be taken as a guide, and may be different to the
performance you get:
Your table structure, unique data business rules, and number of records will be different
Your database version and specs will be different (I’m running Oracle 11g Express
Edition on my laptop)
Several other factors may impact performance such as query cache or background
processes on my computer.
So, I guess my point is that you might not get the same performance as I do. But, if you try one
method and it has poor performance for your data, try one of the other methods.
1. Using a subquery to find each ROWID (which is a unique number given to each row in an
Oracle table) and the ROW_NUMBER function to find a sequential number for that row,
grouped by the fields you specify as unique.
2. Find the ROWID values that are identified as duplicates.
3. Delete rows that match these ROWIDs.
The ROW_NUMBER function here is used as an analytic function. It uses the PARTITION BY
to create partitions or groups based on the fields I’ve mentioned in the PARTITION BY (which
are first_name, last_name, and created_date). I’ve then ordered them by the ROWID.
This means that the first occurrence of those fields will get the number of 1. The second
occurrence of all of those fields will get a number of 2, and so on.
The ROWIDs are then returned to the DELETE statement at the top, which only deletes records
where the ROW_NUMBER function (which has an alias of “dup” in this example) are greater
than one. (The AskTOM thread uses “WHERE dup <> 1” but it achieves the same thing).
It’s a good idea to check the records you’re deleting first, by running a SELECT statement using
this criteria. For the purposes of these examples, I’ll check the COUNT of the records about to be
deleted, by replacing the DELETE with a SELECT COUNT(*).
SELECT COUNT(*)
FROM all_customer_dupes a
WHERE a.ROWID IN
(SELECT ROWID FROM
(SELECT
ROWID,
ROW_NUMBER() OVER
(PARTITION BY first_name, last_name, created_date ORDER BY ROWID) dup
FROM all_customer_dupes)
WHERE dup > 1);
Result:
40,267
------------------------------------------------------------------------------
---------------------
| Id | Operation | Name | Rows | Bytes |
Cost (%CPU)| Time |
------------------------------------------------------------------------------
---------------------
| 0 | DELETE STATEMENT | | 1 | 150
| 1454 (1)| 00:00:18 |
| 1 | DELETE | ALL_CUSTOMER_DUPES
| | | | |
| 2 | NESTED LOOPS | | 1 | 150
| 1454 (1)| 00:00:18 |
| 3 | VIEW | VW_NSO_1 | 1 | 12
| 1452 (1)| 00:00:18 |
| 4 | SORT UNIQUE | | 1 | 25
| | |
|* 5 | VIEW | | 1 | 25
| 1452 (1)| 00:00:18 |
| 6 | WINDOW SORT | | 1 | 125
| 1452 (1)| 00:00:18 |
| 7 | TABLE ACCESS FULL | ALL_CUSTOMER_DUPES | 1 | 125
| 1451 (1)| 00:00:18 |
| 8 | TABLE ACCESS BY USER ROWID| ALL_CUSTOMER_DUPES | 1 | 138
| 1 (0)| 00:00:01 |
------------------------------------------------------------------------------
---------------------
This is the method I would use if I needed to delete duplicate records from a table. It uses a
ROWID filter which is usually a fast way to access a table.
I’ll show you the query first, then explain how it works.
Both the subquery and the outer query should be selecting from the same table, but use different
table aliases. In this example I have used a and b.
The WHERE clause in the subquery is used to define how you identify a duplicate. You write
your WHERE clause that joins on every column that you consider when you identify duplicates.
This could be one field, two fields, or every field in the table.
The rowid pseudocolumn is used to identify rows in the subquery because this is guaranteed to be
unique.
The WHERE clause of the outer query uses a > ANY condition to check for duplicates. It will
delete any row that has a rowid greater than at least one other row. This ensures that all but one of
the rows that match your conditions is met, therefore removing all duplicates.
DELETE
FROM all_customer_dupes a
WHERE a.rowid > ANY (
SELECT b.rowid
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);
If we run this as a SELECT COUNT(*), we can see the number of records impacted:
SELECT COUNT(*)
FROM all_customer_dupes a
WHERE a.rowid > ANY (
SELECT b.rowid
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);
Result:
This query has found 40,267 duplicate values. It’s the same as Example 1.
DELETE
FROM all_customer_dupes a
WHERE a.rowid > ANY (
SELECT b.rowid
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);
Method 3 – DENSE_RANK
This method is similar to method 1 but uses the DENSE_RANK method.
It uses the same concept as ROW_NUMBER, but uses the DENSE_RANK function. There are
some differences between using these functions, but in this case, they give the same output.
40,267 rows.
Result:
This deletes all of the duplicate records we found. It runs in 13.809 seconds.
In this method, you can also use a RANK function instead of DENSE_RANK. It should show the
same results.
------------------------------------------------------------------------------
---------------------
| Id | Operation | Name | Rows | Bytes |
Cost (%CPU)| Time |
------------------------------------------------------------------------------
---------------------
| 0 | DELETE STATEMENT | | 1 | 150
| 1454 (1)| 00:00:18 |
| 1 | DELETE | ALL_CUSTOMER_DUPES
| | | | |
| 2 | NESTED LOOPS | | 1 | 150
| 1454 (1)| 00:00:18 |
| 3 | VIEW | VW_NSO_1 | 1 | 12
| 1452 (1)| 00:00:18 |
| 4 | SORT UNIQUE | | 1 | 25
| | |
|* 5 | VIEW | | 1 | 25
| 1452 (1)| 00:00:18 |
| 6 | WINDOW SORT | | 1 | 125
| 1452 (1)| 00:00:18 |
| 7 | TABLE ACCESS FULL | ALL_CUSTOMER_DUPES | 1 | 125
| 1451 (1)| 00:00:18 |
| 8 | TABLE ACCESS BY USER ROWID| ALL_CUSTOMER_DUPES | 1 | 138
| 1 (0)| 00:00:01 |
------------------------------------------------------------------------------
---------------------
This example here uses the MIN function, but you can use the MAX function to get the same
result.
The columns in the GROUP BY statement are those that are checked for duplicates. In the
example I’ve been using, this is the first_name, last_name, and created_date.
If I run it as a SELECT COUNT(*) first, I can see the number of rows impacted.
SELECT COUNT(*)
FROM all_customer_dupes
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM all_customer_dupes
GROUP BY first_name, last_name, created_date
);
Result:
40,267 rows.
Result:
40,267 rows deleted.
This query ran in 22.985 seconds. It’s a little slower than the other versions but it could just be
the data that I am using.
------------------------------------------------------------------------------
--------------
| Id | Operation | Name | Rows | Bytes | Cost
(%CPU)| Time |
------------------------------------------------------------------------------
--------------
| 0 | DELETE STATEMENT | | 1 | 150
| 2903 (1)| 00:00:35 |
| 1 | DELETE | ALL_CUSTOMER_DUPES
| | | | |
|* 2 | HASH JOIN ANTI | | 1 | 150
| 2903 (1)| 00:00:35 |
| 3 | TABLE ACCESS FULL | ALL_CUSTOMER_DUPES | 1 | 138
| 1451 (1)| 00:00:18 |
| 4 | VIEW | VW_NSO_1 | 1 | 12
| 1452 (1)| 00:00:18 |
| 5 | SORT GROUP BY | | 1 | 125
| 1452 (1)| 00:00:18 |
| 6 | TABLE ACCESS FULL| ALL_CUSTOMER_DUPES | 1 | 125
| 1451 (1)| 00:00:18 |
------------------------------------------------------------------------------
--------------
DELETE
FROM table a
WHERE ROWID NOT IN (
SELECT MAX(ROWID)
FROM table b
WHERE a.col1 = b.col1
AND a.col2 = b.col2
AND a.col3 = b.col3
);
It’s similar to the earlier query, but instead of using a GROUP BY clause, we use a WHERE
clause. This WHERE clause joins the table inside the subquery to the table outside the subquery.
The tables are joined on the columns that match.
SELECT COUNT(*)
FROM all_customer_dupes a
WHERE ROWID NOT IN (
SELECT MAX(ROWID)
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);
Result:
40,267 rows.
DELETE
FROM all_customer_dupes a
WHERE ROWID NOT IN (
SELECT MAX(ROWID)
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);
Result:
------------------------------------------------------------------------------
--------------
| Id | Operation | Name | Rows | Bytes | Cost
(%CPU)| Time |
------------------------------------------------------------------------------
--------------
| 0 | DELETE STATEMENT | | 1 | 263
| 1454 (1)| 00:00:18 |
| 1 | DELETE | ALL_CUSTOMER_DUPES
| | | | |
| 2 | NESTED LOOPS | | 1 | 263
| 1454 (1)| 00:00:18 |
| 3 | VIEW | VW_SQ_1 | 1 | 125
| 1452 (1)| 00:00:18 |
| 4 | SORT GROUP BY | | 1 | 125
| 1452 (1)| 00:00:18 |
| 5 | TABLE ACCESS FULL| ALL_CUSTOMER_DUPES | 1 | 125
| 1451 (1)| 00:00:18 |
|* 6 | INDEX RANGE SCAN | IDX_CSTDUP_FNLNCD | 1 | 138
| 2 (0)| 00:00:01 |
------------------------------------------------------------------------------
--------------
I’ve added these so you can see if specific methods work, for example if someone at work
mentions a method they have tried. Or, if you’ve found a method that works on SQL Server and
want to try it on an Oracle database.
So, here are the other methods I found to delete duplicate data.
Just to be clear, these don’t work, and they are just here for reference.
This method uses a WHERE clause with specified columns which are checked against a
subquery.
It seems very similar to method 4, with the addition of the three columns instead of using
ROWID.
I ran this query for a few minutes but then stopped it without getting a result. I imagine it runs
slowly because of the WHERE clause on the three fields, rather than the ROWID mentioned
earlier.
This query may work for you if one of the fields you’re looking at is an ID or some kind of
number (to get the MAX value of), and if you have indexes on these fields.
This is a method that may work in SQL Server, but does not work in Oracle.
It uses a LEFT OUTER JOIN from the table that has the records to delete, to a subquery that
contains the duplicate records.
DELETE (
SELECT d.*, d.rowid
FROM all_customer_dupes d
LEFT OUTER JOIN (
SELECT MIN(RowId) AS MinRowId,
first_name, last_name, created_date
FROM all_customer_dupes
GROUP BY first_name, last_name, created_date
) KeepRows ON d.RowId = KeepRows.MinRowId
WHERE KeepRows.MinRowId IS NULL;
SELECT COUNT(*)
FROM all_customer_dupes
LEFT OUTER JOIN (
SELECT MIN(RowId) AS MinRowId,
first_name, last_name, created_date
FROM all_customer_dupes
GROUP BY first_name, last_name, created_date
) KeepRows ON all_customer_dupes.RowId = KeepRows.MinRowId
WHERE KeepRows.MinRowId IS NULL;
Now, let’s delete the records by running the DELETE command above.
DELETE (
SELECT d.*, d.rowid
FROM all_customer_dupes d
LEFT OUTER JOIN (
SELECT MIN(RowId) AS MinRowId,
first_name, last_name, created_date
FROM all_customer_dupes
GROUP BY first_name, last_name, created_date
) KeepRows ON d.RowId = KeepRows.MinRowId
WHERE KeepRows.MinRowId IS NULL
Result:
SQL Error: ORA-01752: cannot delete from view without exactly one key-
preserved table
01752. 00000 - "cannot delete from view without exactly one key-preserved
table"
*Cause: The deleted table had
- no key-preserved tables,
- more than one key-preserved table, or
- the key-preserved table was an unmerged view.
*Action: Redefine the view or delete it from the underlying base tables.
This query did not work for me and showed this error. I suspect this method works on SQL
Server but not on Oracle.
Method 8 – DELETE with WITH Clause
This method is also based on a concept that works in SQL Server called CTE or Common Table
Expressions. They work very much the same as Subquery Refactoring, or the WITH clause, in
Oracle.
WITH cte AS
(SELECT ROW_NUMBER()
OVER (PARTITION BY first_name, last_name, created_date
ORDER BY (customer_id)) AS RN
FROM all_customer_dupes)
DELETE FROM cte
SELECT COUNT(*) FROM cte
WHERE rn > 1;
I’ve used the same logic as earlier queries, and when I run this query I get this:
This is because, in Oracle, the WITH clause needs to have a SELECT query that follows it, and
not a DELETE query. It doesn’t work with DELETE statements.
This method uses two tables joined to each other on the matching columns, and a ROWID check
to only find those that are unique. I’ve specified the table to delete after the DELETE keyword so
Oracle will know which table I want to delete from.
DELETE (
SELECT t1.rowid
FROM all_customer_dupes t1
INNER JOIN all_customer_dupes t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.created_date = t2.created_date
AND t1.rowid > t2.rowid
);
SELECT COUNT(t1.customer_id)
FROM all_customer_dupes t1
INNER JOIN all_customer_dupes t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.created_date = t2.created_date
AND t1.rowid > t2.rowid;
Records:
40,267 rows.
It shows the right amount of rows. Now, let’s try to delete them.
DELETE (
SELECT t1.rowid
FROM all_customer_dupes t1
INNER JOIN all_customer_dupes t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.created_date = t2.created_date
AND t1.rowid > t2.rowid
);
Result:
SQL Error: ORA-01752: cannot delete from view without exactly one key-
preserved table
01752. 00000 - "cannot delete from view without exactly one key-preserved
table"
*Cause: The deleted table had
- no key-preserved tables,
- more than one key-preserved table, or
- the key-preserved table was an unmerged view.
*Action: Redefine the view or delete it from the underlying base tables.
This happens because Oracle does not support two tables in a DELETE statement. Even though
we have specified which one we want to delete from, it’s not allowed in Oracle.
This method is similar to earlier methods that work, but uses an inline view instead of an actual
table.
DELETE
FROM (SELECT c.*,
ROW_NUMBER()
OVER (PARTITION BY first_name, last_name, created_date
ORDER BY ROWID DESC) AS RN
FROM all_customer_dupes c
) sub
WHERE rn > 1;
Instead of saying DELETE FROM all_customer_dupes, I am deleting from the subquery itself.
Let’s try to COUNT the number of records.
SELECT COUNT(*)
FROM (SELECT c.*,
ROW_NUMBER()
OVER (PARTITION BY first_name, last_name, created_date
ORDER BY ROWID DESC) AS RN
FROM all_customer_dupes c
) sub
WHERE rn > 1;
Result:
40,267 rows.
DELETE
FROM (SELECT c.*,
ROW_NUMBER()
OVER (PARTITION BY first_name, last_name, created_date
ORDER BY ROWID DESC) AS RN
FROM all_customer_dupes c
) sub
WHERE rn > 1;
Result:
Error report –
SQL Error: ORA-01732: data manipulation operation not legal on this view
01732. 00000 - "data manipulation operation not legal on this view"
*Cause:
*Action:
This error occurs because Oracle does not allow deletions on inline views like this. You must
delete from an actual table.
This method is another one I found online that does not work:
This query looks like it uses a RANK function as an analytic function with a primary key like
earlier examples, but it adds on a subquery in the WHERE clause. This subquery causes an error,
because the GROUP BY cannot be used with columns that are not in the SELECT clause.
So, this function does not even run as it has syntax errors.
Conclusion
In conclusion, there are a few ways to delete duplicate records from your table with SQL:
Method 1 was the best performing method for me, followed by method 3.