How To Remove Duplicate Records in SQL

databasestar.
com
How to Remove Duplicate Records in SQL
Do you need to use SQL to remove duplicates in your tables? Learn how to write an SQL query
to remove duplicate data, and see the performance of each way you can do it, in this article.
The Problem – Removing Duplicates in SQL

Let’s say you have a table with some data in it. You’ve found out that there is some duplicate
data in this table. And you want to get rid of the duplicates.
The way you define duplicate data could be dependent on your data.
 Is it a duplicate if all of the columns are the same?

 Is it a duplicate if all columns except for the primary key are the same?
 Is it a duplicate if only a few columns are the same?
In any case, identifying and removing duplicates is possible in Oracle SQL.
There are several ways to do it. I’ll explain each of these methods and show you how they
perform on a random sample of data.
The sample of data has 1,000,000 records in a single table, which looks like this:
CUSTOMER FIRST_ LAST_ MAX_ CREATED_ UPDATED_ UPDATED_

_ID NAME NAME CREDIT DATE DATE COUNT
1 Carolyn Hudson 6567 13/Jun/17 (null) 0
2 Judy Ward 2535 15/Dec/16 (null) 0
3 James Medina 2738 26/May/17 (null) 0
4 Peter Greene 270 03/Dec/16 (null) 0
5 Sean Ortiz 2452 17/May/17 (null) 0
Let’s say that a record is a duplicate if it contains the same first_name, last_name, and
created_date values.
Let’s take a look at the different ways to remove duplicates in SQL.
Get Your SQL Cheat Sheet

Three pages of SQL commands to save/print for easy reference
A Note on Query Times
In each of these examples I explain the code I am using, what it does, and show how long it takes
to run the DELETE statement.
However, the query times should only be taken as a guide, and may be different to the
performance you get:
 Your table structure, unique data business rules, and number of records will be different
 Your database version and specs will be different (I’m running Oracle 11g Express
Edition on my laptop)
 Several other factors may impact performance such as query cache or background
processes on my computer.
So, I guess my point is that you might not get the same performance as I do. But, if you try one
method and it has poor performance for your data, try one of the other methods.
Method 1 – ROW_NUMBER Analytic Function

The first method I’ll show you is using an analytic function called ROW_NUMBER. It’s been
recommended in several places such as StackOverflow questions and an AskTOM thread.
It involves several steps:
1. Using a subquery to find each ROWID (which is a unique number given to each row in an
Oracle table) and the ROW_NUMBER function to find a sequential number for that row,
grouped by the fields you specify as unique.
2. Find the ROWID values that are identified as duplicates.
3. Delete rows that match these ROWIDs.
The query looks like this:
DELETE FROM table a

WHERE a.ROWID IN
(SELECT ROWID FROM
(SELECT
ROWID,
ROW_NUMBER() OVER
(PARTITION BY unique_columns ORDER BY ROWID) dup
FROM table)
WHERE dup > 1);
The ROW_NUMBER function here is used as an analytic function. It uses the PARTITION BY
to create partitions or groups based on the fields I’ve mentioned in the PARTITION BY (which
are first_name, last_name, and created_date). I’ve then ordered them by the ROWID.
This means that the first occurrence of those fields will get the number of 1. The second
occurrence of all of those fields will get a number of 2, and so on.
The ROWIDs are then returned to the DELETE statement at the top, which only deletes records
where the ROW_NUMBER function (which has an alias of “dup” in this example) are greater
than one. (The AskTOM thread uses “WHERE dup <> 1” but it achieves the same thing).
It’s a good idea to check the records you’re deleting first, by running a SELECT statement using
this criteria. For the purposes of these examples, I’ll check the COUNT of the records about to be
deleted, by replacing the DELETE with a SELECT COUNT(*).
SELECT COUNT(*)
FROM all_customer_dupes a
WHERE a.ROWID IN
(SELECT ROWID FROM
(SELECT
ROWID,
ROW_NUMBER() OVER
(PARTITION BY first_name, last_name, created_date ORDER BY ROWID) dup
FROM all_customer_dupes)
WHERE dup > 1);
Result:
40,267
It shows there are 40,267 duplicate records.
If I run this command as a DELETE statement:
DELETE FROM all_customer_dupes a

WHERE a.ROWID IN
(SELECT ROWID FROM
(SELECT
ROWID,
ROW_NUMBER() OVER
(PARTITION BY first_name, last_name, created_date ORDER BY ROWID) dup
WHERE dup > 1);
40,267 records deleted.
This statement runs in 12.73 seconds.
The explain plan looks like this:
------------------------------------------------------------------------------
---------------------
| Id | Operation | Name | Rows | Bytes |
Cost (%CPU)| Time |
------------------------------------------------------------------------------
---------------------
| 0 | DELETE STATEMENT | | 1 | 150
| 1454 (1)| 00:00:18 |
| 1 | DELETE | ALL_CUSTOMER_DUPES
| | | | |
| 2 | NESTED LOOPS | | 1 | 150
| 1454 (1)| 00:00:18 |
| 3 | VIEW | VW_NSO_1 | 1 | 12
| 1452 (1)| 00:00:18 |
| 4 | SORT UNIQUE | | 1 | 25
| | |
|* 5 | VIEW | | 1 | 25
| 1452 (1)| 00:00:18 |
| 6 | WINDOW SORT | | 1 | 125
| 1452 (1)| 00:00:18 |
| 7 | TABLE ACCESS FULL | ALL_CUSTOMER_DUPES | 1 | 125
| 1451 (1)| 00:00:18 |
| 8 | TABLE ACCESS BY USER ROWID| ALL_CUSTOMER_DUPES | 1 | 138
| 1 (0)| 00:00:01 |
------------------------------------------------------------------------------
---------------------
This is the method I would use if I needed to delete duplicate records from a table. It uses a
ROWID filter which is usually a fast way to access a table.
Method 2: Use a Subquery with ANY

The next method we’ll look at is using a subquery to identify and delete duplicate data.
I’ll show you the query first, then explain how it works.
DELETE FROM tablename a

WHERE a.rowid > ANY (
SELECT b.rowid
FROM tablename b
WHERE a.column1 = b.column1
);
Now, there are a few things to note about this query.
Both the subquery and the outer query should be selecting from the same table, but use different
table aliases. In this example I have used a and b.
The WHERE clause in the subquery is used to define how you identify a duplicate. You write
your WHERE clause that joins on every column that you consider when you identify duplicates.
This could be one field, two fields, or every field in the table.
The rowid pseudocolumn is used to identify rows in the subquery because this is guaranteed to be
unique.
The WHERE clause of the outer query uses a > ANY condition to check for duplicates. It will
delete any row that has a rowid greater than at least one other row. This ensures that all but one of
the rows that match your conditions is met, therefore removing all duplicates.
So, how does this method perform?
Using this query on our sample data looks like this:
DELETE
SELECT b.rowid
FROM all_customer_dupes b
WHERE a.first_name = b.first_name
AND a.last_name = b.last_name
AND a.created_date = b.created_date
);
If we run this as a SELECT COUNT(*), we can see the number of records impacted:
SELECT COUNT(*)
SELECT b.rowid
);
Result:
This query has found 40,267 duplicate values. It’s the same as Example 1.
Now, let’s run it as a DELETE query.
DELETE
SELECT b.rowid
);
40,267 rows deleted.
This query ran in 16.76 seconds.
The explain plan for the query looks like this:

------------------------------------------------------------------------------
------------
| Id | Operation | Name | Rows | Bytes | Cost
(%CPU)| Time |
------------------------------------------------------------------------------
------------
| 1453 (1)| 00:00:18 |
| | | | |
| 2 | NESTED LOOPS SEMI | | 1 | 263
| 1453 (1)| 00:00:18 |
| 3 | TABLE ACCESS FULL| ALL_CUSTOMER_DUPES | 1 | 138
| 1451 (1)| 00:00:18 |
|* 4 | INDEX RANGE SCAN | IDX_CSTDUP_FNLNCD | 1 | 125
| 2 (0)| 00:00:01 |
------------------------------------------------------------------------------
------------
Method 3 – DENSE_RANK
This method is similar to method 1 but uses the DENSE_RANK method.
In fact, it’s almost exactly the same.
The syntax looks like this:
DELETE FROM table a

WHERE a.ROWID IN (
SELECT rid FROM (
SELECT ROWID rid,
DENSE_RANK() OVER(PARTITION BY unique_columns ORDER BY ROWID) dup
FROM table
)
WHERE dup > 1
);
It uses the same concept as ROW_NUMBER, but uses the DENSE_RANK function. There are
some differences between using these functions, but in this case, they give the same output.
Let’s find the COUNT(*) of records first.
SELECT COUNT(*) FROM all_customer_dupes

WHERE ROWID IN (
SELECT rid FROM (
SELECT ROWID rid,
DENSE_RANK() OVER(
PARTITION BY first_name, last_name, created_date ORDER BY ROWID) dup
FROM all_customer_dupes
)
WHERE dup > 1
);
Result:
40,267 rows.
This shows the same number of records as method 1 and 2.
Now, let’s try the DELETE statement.
DELETE FROM all_customer_dupes

WHERE ROWID IN (
SELECT rid FROM (
SELECT ROWID rid,
DENSE_RANK() OVER(
PARTITION BY first_name, last_name, created_date ORDER BY ROWID) dup
)
WHERE dup > 1
);
Result:
This deletes all of the duplicate records we found. It runs in 13.809 seconds.
In this method, you can also use a RANK function instead of DENSE_RANK. It should show the
same results.
------------------------------------------------------------------------------
---------------------
| Id | Operation | Name | Rows | Bytes |
Cost (%CPU)| Time |
------------------------------------------------------------------------------
---------------------
| 1454 (1)| 00:00:18 |
| | | | |
| 2 | NESTED LOOPS | | 1 | 150
| 1454 (1)| 00:00:18 |
| 3 | VIEW | VW_NSO_1 | 1 | 12
| 1452 (1)| 00:00:18 |
| 4 | SORT UNIQUE | | 1 | 25
| | |
|* 5 | VIEW | | 1 | 25
| 1452 (1)| 00:00:18 |
| 6 | WINDOW SORT | | 1 | 125
| 1452 (1)| 00:00:18 |
| 1451 (1)| 00:00:18 |
| 8 | TABLE ACCESS BY USER ROWID| ALL_CUSTOMER_DUPES | 1 | 138
| 1 (0)| 00:00:01 |
------------------------------------------------------------------------------
---------------------
Method 4 – MIN or MAX Function

This method uses either the MIN or MAX function to find duplicates inside a subquery.
It’s similar to earlier examples, but it uses less subqueries.
This example here uses the MIN function, but you can use the MAX function to get the same
result.
The syntax looks like this:
DELETE FROM your_table

WHERE rowid not in (
SELECT MIN(rowid)
FROM your_table
GROUP BY column1, column2, column3
);
The columns in the GROUP BY statement are those that are checked for duplicates. In the
example I’ve been using, this is the first_name, last_name, and created_date.
If I run it as a SELECT COUNT(*) first, I can see the number of rows impacted.
SELECT COUNT(*)
WHERE rowid NOT IN (
SELECT MIN(rowid)
GROUP BY first_name, last_name, created_date
);
Result:
40,267 rows.
Now, I can run the DELETE statement.
DELETE FROM all_customer_dupes

WHERE rowid NOT IN (
SELECT MIN(rowid)
);
Result:
This query ran in 22.985 seconds. It’s a little slower than the other versions but it could just be
the data that I am using.
------------------------------------------------------------------------------
--------------
(%CPU)| Time |
------------------------------------------------------------------------------
--------------
| 2903 (1)| 00:00:35 |
| | | | |
|* 2 | HASH JOIN ANTI | | 1 | 150
| 2903 (1)| 00:00:35 |
| 1451 (1)| 00:00:18 |
| 4 | VIEW | VW_NSO_1 | 1 | 12
| 1452 (1)| 00:00:18 |
| 5 | SORT GROUP BY | | 1 | 125
| 1452 (1)| 00:00:18 |
| 1451 (1)| 00:00:18 |
------------------------------------------------------------------------------
--------------
Method 5 – Correlated Subquery with MIN or MAX

The next method I’ll share is similar to method 4, but uses a correlated subquery to match on
columns.
DELETE
FROM table a
WHERE ROWID NOT IN (
SELECT MAX(ROWID)
FROM table b
WHERE a.col1 = b.col1
AND a.col2 = b.col2
AND a.col3 = b.col3
);
It’s similar to the earlier query, but instead of using a GROUP BY clause, we use a WHERE
clause. This WHERE clause joins the table inside the subquery to the table outside the subquery.
The tables are joined on the columns that match.
Let’s see how many records are found.
SELECT COUNT(*)
SELECT MAX(ROWID)
);
Result:
40,267 rows.
Now, let’s run this as a DELETE query:
DELETE
SELECT MAX(ROWID)
);
Result:
This query took 18.141 seconds to run.
------------------------------------------------------------------------------
--------------
(%CPU)| Time |
------------------------------------------------------------------------------
--------------
| 1454 (1)| 00:00:18 |
| | | | |
| 2 | NESTED LOOPS | | 1 | 263
| 1454 (1)| 00:00:18 |
| 3 | VIEW | VW_SQ_1 | 1 | 125
| 1452 (1)| 00:00:18 |
| 4 | SORT GROUP BY | | 1 | 125
| 1452 (1)| 00:00:18 |
| 1451 (1)| 00:00:18 |
|* 6 | INDEX RANGE SCAN | IDX_CSTDUP_FNLNCD | 1 | 138
| 2 (0)| 00:00:01 |
------------------------------------------------------------------------------
--------------
Other Methods You Might Come Across

I’ve only found five different ways for deleting duplicate data in SQL. I’ve listed a few more
below, which don’t actually work as well as the others, if at all.
I’ve added these so you can see if specific methods work, for example if someone at work
mentions a method they have tried. Or, if you’ve found a method that works on SQL Server and
want to try it on an Oracle database.
So, here are the other methods I found to delete duplicate data.
Just to be clear, these don’t work, and they are just here for reference.
Method 6 – DELETE WHERE IN
This method uses a WHERE clause with specified columns which are checked against a
subquery.
It seems very similar to method 4, with the addition of the three columns instead of using
ROWID.
The method looks like this:
DELETE FROM table WHERE

(col1, col2, col3) IN (
SELECT MAX(col1), col2, col3
FROM table
GROUP BY col2, col3
);
I ran this query for a few minutes but then stopped it without getting a result. I imagine it runs
slowly because of the WHERE clause on the three fields, rather than the ROWID mentioned
earlier.
This query may work for you if one of the fields you’re looking at is an ID or some kind of
number (to get the MAX value of), and if you have indexes on these fields.
But, for me, this query did not work.
Method 7 – DELETE with LEFT OUTER JOIN
This is a method that may work in SQL Server, but does not work in Oracle.
It uses a LEFT OUTER JOIN from the table that has the records to delete, to a subquery that
contains the duplicate records.
DELETE (
SELECT d.*, d.rowid
FROM all_customer_dupes d
LEFT OUTER JOIN (
SELECT MIN(RowId) AS MinRowId,
first_name, last_name, created_date
) KeepRows ON d.RowId = KeepRows.MinRowId
WHERE KeepRows.MinRowId IS NULL;
Let’s check the COUNT first:
SELECT COUNT(*)
LEFT OUTER JOIN (
) KeepRows ON all_customer_dupes.RowId = KeepRows.MinRowId
WHERE KeepRows.MinRowId IS NULL;
Now, let’s delete the records by running the DELETE command above.
DELETE (
SELECT d.*, d.rowid
FROM all_customer_dupes d
LEFT OUTER JOIN (
) KeepRows ON d.RowId = KeepRows.MinRowId
WHERE KeepRows.MinRowId IS NULL
Result:
SQL Error: ORA-01752: cannot delete from view without exactly one key-
preserved table
01752. 00000 - "cannot delete from view without exactly one key-preserved
table"
*Cause: The deleted table had
- no key-preserved tables,
- more than one key-preserved table, or
- the key-preserved table was an unmerged view.
*Action: Redefine the view or delete it from the underlying base tables.
This query did not work for me and showed this error. I suspect this method works on SQL
Server but not on Oracle.
Method 8 – DELETE with WITH Clause
This method is also based on a concept that works in SQL Server called CTE or Common Table
Expressions. They work very much the same as Subquery Refactoring, or the WITH clause, in
Oracle.
The query looks like this:
WITH cte AS
(SELECT ROW_NUMBER()
OVER (PARTITION BY first_name, last_name, created_date
ORDER BY (customer_id)) AS RN
DELETE FROM cte
SELECT COUNT(*) FROM cte
WHERE rn > 1;
I’ve used the same logic as earlier queries, and when I run this query I get this:
ORA-00928: missing SELECT keyword

00928. 00000 - "missing SELECT keyword"
*Cause:
*Action:
This is because, in Oracle, the WITH clause needs to have a SELECT query that follows it, and
not a DELETE query. It doesn’t work with DELETE statements.
Method 9 – Two Tables Joined
This method uses two tables joined to each other on the matching columns, and a ROWID check
to only find those that are unique. I’ve specified the table to delete after the DELETE keyword so
Oracle will know which table I want to delete from.
DELETE (
SELECT t1.rowid
FROM all_customer_dupes t1
INNER JOIN all_customer_dupes t2
ON t1.first_name = t2.first_name
AND t1.last_name = t2.last_name
AND t1.created_date = t2.created_date
AND t1.rowid > t2.rowid
);
Let’s try getting a count first:
SELECT COUNT(t1.customer_id)
AND t1.rowid > t2.rowid;
Records:
40,267 rows.
It shows the right amount of rows. Now, let’s try to delete them.
DELETE (
SELECT t1.rowid
AND t1.rowid > t2.rowid
);
Result:
SQL Error: ORA-01752: cannot delete from view without exactly one key-
preserved table
01752. 00000 - "cannot delete from view without exactly one key-preserved
table"
*Cause: The deleted table had
- no key-preserved tables,
- more than one key-preserved table, or
- the key-preserved table was an unmerged view.
*Action: Redefine the view or delete it from the underlying base tables.
It shows a similar error as the earlier method.
This happens because Oracle does not support two tables in a DELETE statement. Even though
we have specified which one we want to delete from, it’s not allowed in Oracle.
Method 10 – Inline View and Analytic Function
This method is similar to earlier methods that work, but uses an inline view instead of an actual
table.
DELETE
FROM (SELECT c.*,
ROW_NUMBER()
ORDER BY ROWID DESC) AS RN
FROM all_customer_dupes c
) sub
WHERE rn > 1;
Instead of saying DELETE FROM all_customer_dupes, I am deleting from the subquery itself.
Let’s try to COUNT the number of records.
SELECT COUNT(*)
FROM (SELECT c.*,
ROW_NUMBER()
) sub
WHERE rn > 1;
Result:
40,267 rows.
The count is correct. Now, let’s try a delete.
DELETE
FROM (SELECT c.*,
ROW_NUMBER()
) sub
WHERE rn > 1;
Result:
Error report –
SQL Error: ORA-01732: data manipulation operation not legal on this view
01732. 00000 - "data manipulation operation not legal on this view"
*Cause:
*Action:
This error occurs because Oracle does not allow deletions on inline views like this. You must
delete from an actual table.
Method 11 – Use RANK and Primary Key
This method is another one I found online that does not work:
DELETE FROM table

WHERE ROWID IN (
SELECT "rowid"
FROM (
SELECT "rowid", rank_n
FROM (
SELECT RANK() OVER (
PARTITION BY unique_columns ORDER BY rowid) rank_n,
rowid as "rowid"
FROM table
WHERE primary_key IN (
SELECT primary_key
FROM table
GROUP BY all_columns
HAVING COUNT(*) > 1
)
)
)
WHERE rank_n > 1
);
This query looks like it uses a RANK function as an analytic function with a primary key like
earlier examples, but it adds on a subquery in the WHERE clause. This subquery causes an error,
because the GROUP BY cannot be used with columns that are not in the SELECT clause.
So, this function does not even run as it has syntax errors.
Conclusion
In conclusion, there are a few ways to delete duplicate records from your table with SQL:
Method Name Run Time (s) Cost

1 ROW_NUMBER Analytic Function 12.730 1454
2 Subquery with Any 27.397 1453
3 DENSE_RANK 13.809 1454
4 MIN or MAX Function 22.985 2903
5 Correlated Subquery with MIN or MAX 18.141 1454
Method 1 was the best performing method for me, followed by method 3.

How To Remove Duplicate Records in SQL

Uploaded by

How To Remove Duplicate Records in SQL

Uploaded by

databasestar.

How to Remove Duplicate Records in SQL

The Problem – Removing Duplicates in SQL

 Is it a duplicate if all of the columns are the same?

In any case, identifying and removing duplicates is possible in Oracle SQL.

CUSTOMER FIRST_ LAST_ MAX_ CREATED_ UPDATED_ UPDATED_

Let’s take a look at the different ways to remove duplicates in SQL.

Get Your SQL Cheat Sheet

Method 1 – ROW_NUMBER Analytic Function

It involves several steps:

The query looks like this:

DELETE FROM table a

It shows there are 40,267 duplicate records.

If I run this command as a DELETE statement:

DELETE FROM all_customer_dupes a

40,267 records deleted.

This statement runs in 12.73 seconds.

The explain plan looks like this:

Method 2: Use a Subquery with ANY

DELETE FROM tablename a

Now, there are a few things to note about this query.

So, how does this method perform?

Using this query on our sample data looks like this:

Now, let’s run it as a DELETE query.

40,267 rows deleted.

This query ran in 16.76 seconds.

The explain plan for the query looks like this:

In fact, it’s almost exactly the same.

The syntax looks like this:

DELETE FROM table a

Let’s find the COUNT(*) of records first.

SELECT COUNT(*) FROM all_customer_dupes

This shows the same number of records as method 1 and 2.

Now, let’s try the DELETE statement.

DELETE FROM all_customer_dupes

40,267 rows deleted.

The explain plan looks like this:

Method 4 – MIN or MAX Function

It’s similar to earlier examples, but it uses less subqueries.

The syntax looks like this:

DELETE FROM your_table

Now, I can run the DELETE statement.

DELETE FROM all_customer_dupes

The explain plan looks like this:

Method 5 – Correlated Subquery with MIN or MAX

Let’s see how many records are found.

Now, let’s run this as a DELETE query:

40,267 rows deleted.

This query took 18.141 seconds to run.

The explain plan looks like this:

Other Methods You Might Come Across

Method 6 – DELETE WHERE IN

The method looks like this:

DELETE FROM table WHERE

But, for me, this query did not work.

Method 7 – DELETE with LEFT OUTER JOIN

Let’s check the COUNT first:

The query looks like this:

ORA-00928: missing SELECT keyword

Method 9 – Two Tables Joined

Let’s try getting a count first:

It shows a similar error as the earlier method.

Method 10 – Inline View and Analytic Function

The count is correct. Now, let’s try a delete.

Method 11 – Use RANK and Primary Key

DELETE FROM table

Method Name Run Time (s) Cost

You might also like