SQL Duplicate Rows

SELECT

SELECT mycol,

COUNT(mycol)

FROM mytable

GROUP BY mycol

HAVING COUNT(mycol) > 1;

SELECT mycol1, COUNT(mycol1),

mycol2, COUNT(mycol2),

mycol3, COUNT(mycol3)

FROM mytable

GROUP BY mycol1,

mycol2,

mycol3

HAVING COUNT(mycol1) > 1

AND COUNT(mycol2) > 1

AND COUNT(mycol3) > 1;

DELETE

DELETE

FROM mytable

GROUP BY mycolumn

HAVING COUNT(*) > 1;

DELETE

FROM mytable m1

WHERE m1.ROWID > (SELECT MIN(m2.ROWID)

FROM mytable m2

WHERE m2.mycolumn = m1.mycolumn);

There are two problems with the approaches above...

  1. You do not have opportunity to review the duplicates

  2. The transaction may be huge on very large tables (leading to blocking and/or ORA-01555 errors)

To address these problems use a procedure like the one below...

Create a non-unique index on the columns in question.

This allows the query to scan just the index rather than the data blocks and makes the column set comparison more efficient.

CREATE INDEX myindex

ON mytable (mycolumn)

...

Create a mirror table to hold the duplicates, which includes the base table rowid as a column.

CREATE TABLE mymirror

AS (SELECT t.rowid row_id,

t.*

FROM mytable t

WHERE 1=2);

Populate the duplicate mirror table with the duplicates from the base table.

All duplicates except the row with the lowest rowid will be inserted. Do an explain to verify the query will use the index. You could include the lowest rowid with a minor change, but be careful on the delete step.

INSERT INTO mymirror

SELECT t1.rowid,

t1.*

FROM mytable t1

WHERE t1.rowid > (SELECT MIN(t2.rowid)

FROM mytable t2

WHERE t1.mycolumn = t2.mycolumn);

Review the duplicates in the mirror table. Then when you are ready to delete the duplicates...

DELETE FROM mytable t

WHERE t.rowid IN (SELECT m.row_id

FROM mymirror m);

Bibliography