No matter where the source data is stored, as a part of the ETL workflow, we will stage source data in a SQL Server 2005 table.
This is the staging table:
create table SourceWithDuplicates
(SurrogateKey int identity(1,1),
Code varchar(12),
Description varchar(20),
OtherAttribute1 int,
OtherAttribute2 Varchar(10),
TieBreakerColumn Varchar(12))
As a general rule I always add a surrogate key column to my staging tables (that makes some DBAs happy); an in this case it would actually help in the de-duplication process.
Some sample data:
Insert into SourceWithDuplicates values(‘Abc’, ‘Row Description 1′, 1, ‘Other1′,’1′)
Insert into SourceWithDuplicates values(‘Abc’, ‘Row Description 1′, 1, ‘Other1′, ‘1′)
Insert into SourceWithDuplicates values(‘Abc’, ‘Row Description 1′, 1, ‘Other2′,’2′)\
Insert into SourceWithDuplicates values(‘Def’, ‘Row Description 2′, 1, ‘Other4′,’a')
Insert into SourceWithDuplicates values(‘Def’, ‘Row Description 2′, 1, ‘Other5′,’a')
Insert into SourceWithDuplicates values(‘Ghi’, ‘Row Description 3′, 1, ‘Other5′,’a')
Insert into SourceWithDuplicates values(‘Jkl’, ‘Row Description 4′, 1, ‘Other5′,’a')
The requirements:
Code and description values need to be unique in the target table; in case they are duplicates, choose the row with the highest value in the tie breaker column. If the value in the tiebreaker column is also duplicated; then choose any of the rows.
So the data looks like:
SurrogateKey Code Description OtherAttribute1 OtherAttribute2 TieBreaker ———— ———— ——————– ————— ————— ———-
1 Abc Row Description 1 1 Other1 1
2 Abc Row Description 1 1 Other1 1
3 Abc Row Description 1 1 Other2 2
4 Def Row Description 2 1 Other4 a
5 Def Row Description 2 1 Other5 a
6 Ghi Row Description 3 1 Other5 a
7 Jkl Row Description 4 1 Other5 a
Following the requirements, only 4 rows kept should be chosen; those with surrogate keys:2 3, 4 or 5, 6 and 7
This is the query using the Rank() function that resolve our issue:
Select *,
rank() over (Partition by Code, description order by TieBreakerColumn desc, SurrogateKey) MyRank
from SourceWithDuplicates
and this is its output:
Surrogatekey Code Description MyRank
———— ———— ——————– ——————–
3 Abc Row Description 1 1
1 Abc Row Description 1 2
2 Abc Row Description 1 3
4 Def Row Description 2 1
5 Def Row Description 2 2
6 Ghi Row Description 3 1
7 Jkl Row Description 4 1
Notice that if we filter the result from the previous query using WHERE MyRank=1 we will get the rows to be kept.