SQL eliminates duplicate records using SELECT DISTINCT for unique results, or ROW_NUMBER with a CTE to delete duplicate rows permanently.
Duplicate records enter databases through import errors, missing unique constraints, and application bugs. Left alone, they inflate report counts, distort analytics, and slow down queries. Whether you need unique output for a dashboard or permanent table cleanup, the right approach depends on how you define “duplicate” and what you want to keep. This article covers the three main strategies — output-only dedup with SELECT DISTINCT, diagnostic grouping with GROUP BY and HAVING, and permanent deletion using ROW_NUMBER() in a CTE — plus the alternatives that handle edge cases.
What Does “Duplicate” Actually Mean In Your SQL Table?
A duplicate is defined by your business rule, not by every column being identical. Two rows with the same email address are duplicates in a user table even if their last_login timestamps differ. Before writing any dedup query, decide which columns form the unique key. That decision determines the partition columns in ROW_NUMBER() and the group columns in GROUP BY. Without a clear definition, every dedup attempt will either miss rows or delete the wrong ones.
Once the key columns are set, the rest of the process follows the same shape regardless of which columns you use.
Finding Duplicates With GROUP BY And HAVING
Before deleting anything, identify which rows are duplicated and how many copies exist. A GROUP BY query with a HAVING clause does that without touching the data.
SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
This returns each duplicate group and its row count. Run it first to confirm your duplicate definition catches the right rows and to get a sense of how much data you are about to clean. Use GROUP BY with all key columns when the duplicate key is composite — for example, GROUP BY customer_id, order_date.
Removing Duplicates From Query Results With SELECT DISTINCT
When you only need unique output — a report, an export, or a dashboard feed — SELECT DISTINCT is the fastest tool. It returns one row per unique combination of the selected columns without modifying the underlying table.
SELECT DISTINCT email, first_name, last_name
FROM users;
The database engine compares every row in the result set and suppresses duplicates. DISTINCT applies to all columns in the SELECT list, so two rows that differ on any returned column are kept as separate rows. On large datasets this can be expensive, but for ad hoc analysis and one-time exports it is the simplest path to clean output.
Eliminating Duplicate Records Permanently With ROW_NUMBER() And A CTE
To delete extra rows from the table itself while keeping one row per duplicate group, the standard modern approach uses ROW_NUMBER() inside a common table expression. This method requires window-function support, which Microsoft introduced in SQL Server 2005 and is now standard in PostgreSQL, MySQL 8+, Oracle, and other major databases.
The pattern assigns a sequential number to each row within a duplicate group, ordered by your keeper rule. Rows with a number greater than 1 are the extras and get deleted.
WITH DuplicateCTE AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY user_id
) AS RowNum
FROM users
)
DELETE FROM DuplicateCTE WHERE RowNum > 1;
The PARTITION BY clause defines the duplicate key — one or more columns separated by commas. The ORDER BY inside the window decides which row survives. In the example above, the row with the smallest user_id stays and newer duplicates are removed. Swap the order to user_id DESC to keep the most recent row instead.
| Method | Best For | Key Limitation |
|---|---|---|
| SELECT DISTINCT | Unique query output without table changes | Does not modify the table; only affects the result set |
| ROW_NUMBER() + CTE | Permanent deletion with precise keeper control | Requires window-function support (SQL Server 2005+) |
| GROUP BY + HAVING | Auditing duplicates before any deletion | Only identifies duplicates; does not remove them |
| Self-join DELETE | Cleanup without window functions | Complex syntax, easy to delete the wrong rows |
| Temp-table rebuild | Large tables needing a clean verified copy | Destructive until the new table is validated |
| GROUP BY + MIN(id) subquery | Simple keeper rule with a single key column | Clunky with composite keys, harder to read |
| DELETE with NOT IN subquery | Quick one-off cleanup in smaller tables | Slow on large tables; NOT IN with NULLs can fail silently |
Alternative Patterns — Self-Join And Temp-Table Workflows
When window functions are not available — older database versions, limited permissions on a hosted platform — two alternative patterns get the job done.
Self-join delete. Join the table to itself on the duplicate key columns and delete the row with the higher key value. The ORDER BY keeper logic is replaced by a WHERE clause that compares IDs.
DELETE u1
FROM users u1
INNER JOIN users u2
ON u1.email = u2.email
AND u1.user_id > u2.user_id;
This deletes every row where a matching row with a smaller user_id exists — effectively keeping the oldest entry per email.
Temp-table rebuild. Copy unique rows into a staging table, verify the row count, truncate the original table, and reinsert the clean data. Microsoft’s official guidance demonstrates this as a two-step approach: SELECT DISTINCT ... INTO a backup table, then DELETE matching rows from the original and reinsert. The full workflow looks like this:
SELECT DISTINCT * INTO users_backup FROM users;
TRUNCATE TABLE users;
INSERT INTO users SELECT * FROM users_backup;
DROP TABLE users_backup;
The temp-table method is safest when you verify the row count between each step and keep a full backup until the cleanup is confirmed.
Microsoft’s detailed walkthrough of both the duplicate-table and ROW_NUMBER() approaches is available in SQL Server’s official dedup documentation.
Choosing The Right Dedup Method For Your Situation
No single method fits every scenario. The table below maps real-world situations to the recommended approach and explains why.
| Situation | Recommended Method | Why |
|---|---|---|
| Ad hoc report or export | SELECT DISTINCT | Fastest path to unique output, no table changes |
| Permanent cleanup with a specific keeper row | ROW_NUMBER() + CTE | Precise control via the ORDER BY clause |
| Audit duplicates before any changes | GROUP BY + HAVING | Zero risk of accidental deletion |
| No window function support in your database | Self-join or temp-table rebuild | Works on older MySQL, SQL Server 2000, and restricted environments |
| Very large table with no backup window | ROW_NUMBER() with batch delete | Can be wrapped in a loop to avoid transaction log bloat |
| Need a verified clean copy before switching | Temp-table rebuild | Lets you validate row counts and indexes before swapping |
Common Mistakes That Derail Dedup Queries
Even experienced developers hit these traps. Run through this checklist before executing any dedup logic on production data.
- Defining duplicates by the wrong columns. A row is only a duplicate if its key columns match another row. Adding a timestamp column to the
PARTITION BYclause by accident means every row looks unique and nothing gets deleted. - Using SELECT DISTINCT when you meant to delete rows.
DISTINCTnever touches the table. If the goal is permanent cleanup,DISTINCTonly hides the problem temporarily. - Omitting the ORDER BY in ROW_NUMBER(). The
ORDER BYdetermines which row survives. Without it, the query either fails or picks an arbitrary row, which can surprise you on the next run. - Skipping the pre-deletion audit. A
GROUP BY ... HAVING COUNT(*) > 1query costs almost nothing and reveals exactly how many duplicates exist. Running a delete without this check can wipe more rows than expected. - Deleting without a backup on large tables. The temp-table rebuild method is inherently destructive until the new table passes verification. Keep a full backup until the cleanup is confirmed.
Dedup Logic You Can Apply Today
Start with the GROUP BY ... HAVING audit to confirm your duplicate definition. For one-time output, use SELECT DISTINCT. For permanent cleanup with a predictable keeper row, the ROW_NUMBER() CTE pattern is the most precise option. Test the query on a copy of the table or inside a transaction so you can roll back if the result isn’t what you expected. With the duplicate key defined and the keeper rule set, any of these methods produces clean, reliable data.
References & Sources
- Microsoft Learn. “How to remove duplicate rows from a SQL Server table by using a script.” Official guidance with ROW_NUMBER() and duplicate-table examples.
