Tuesday, March 14, 2006

Deleting Duplicate Records in SQL Server

This article explains a solution which can be used for deleting non-unique duplicate records from a table. In Oracle, every row in a table has a unique identifier so deleting duplicate records can be done as each record is unique based on its rowid (rowid is the mechanism which Oracle uses to physically locate the records in a table). But as SQL Server 7.0/2000 does not have the concept of row identifier it’s difficult to delete duplicate records from a table.

I have created a Script which I have found very useful when I am faced with such a problem. In the example below I have tried to explain how the script works. First let’s create a non indexed table using the script given below:

/*********************************************************/
/* Script for Creation of Employee Table*/
CREATE TABLE [dbo].[Employee] (
[id] [int] NULL ,
[name] [Varchar] (50) COLLATE Latin1_General_CI_AS NULL ,
[salary] [Numeric](18, 2) NULL
) ON [PRIMARY]
GO
/*********************************************************/

Then add some records into the table using the script given below. Note that there are 5 similar records being inserted into the table

/**********************************************************/
/* Script for Insertion of duplicate records to the Table */
Insert into employee values (1,'Ram', 1000.00)
Insert into employee values (1,'Ram', 1000.00)
Insert into employee values (2,'Joe', 2000.00)
Insert into employee values (2,'Joe', 1000.00)
Insert into employee values (3,'Mary', 1000.00)
Insert into employee values (4,'Julie', 5000.00)
Insert into employee values (2,'Joe', 1000.00)
Insert into employee values (1,'Ram', 1000.00)

/**********************************************************/

The table below shows the data present in the ‘Employee’ table.

id Name Salary
1 Ram 1000

1 Ram 1000
2 Joe 2000
2 Joe 1000
3 Mary 1000
4 Julie 5000
2 Joe 1000
1 Ram 1000

The following script deletes the duplicates from ‘Employee’ table. If there are duplicate records then the script will get a count of the duplicate records and will then delete the records till the count is 1.

/*******************************************************************/
/* Script for deletion of Duplicate record from the Employee Table */
Declare @id int,
@name varchar (50),
@cnt int,
@salary numeric

Declare getallrecords cursor local static For
Select count (1), id, name, salary
from employee (nolock)
group by id, name,salary having count(1)>1

Open getallrecords

Fetch next from getallrecords into @cnt,@id,@name,@salary
--Cursor to check with all other records
While @@fetch_status=0
Begin
Set @cnt= @cnt-1
Set rowcount @cnt

-- Deleting the duplicate records. Observe that all fields are mentioned at the where condition
Delete from employee where id=@id and name=@name
and salary=@salary

Set rowcount 0

Fetch next from getallrecords into @cnt,@id,@name,@salary
End

Close getallrecords
Deallocate getallrecords

*******************************************************************

The logic of the script is pretty simple; the select query retrieves all the records that are duplicates i.e. having Count greater than one. The result set is retrieved by opening a local cursor which fetches one row at a time.

Note here that the Count column is a part of the select query; this is used to identify the no of duplicate rows in the result set.

The row count has been set to (Value obtained from the Count Column – 1). SQL Server uses rowcount to stop processing the query after the specified numbers of rows are returned. The delete statement is executed only to delete the rows set by the Set rowcount command. Once the records have been deleted, the rowcount of SQL server is reset to the default value of 0.



... Original Article on sqlserver central

Digg it !