Distinct Values

Data often contains repetitive information. To find unique entries or count how many unique categories exist, we use the DISTINCT keyword.

Definition: The DISTINCT Keyword

DISTINCT is used in the SELECT clause to remove duplicate rows from the result set. It ensures that every row returned is unique across all selected columns.

Context: Usage Patterns

DISTINCT is used in two primary ways:

Row-level uniqueness: SELECT DISTINCT Category FROM Products;
Inside an aggregate: SELECT COUNT(DISTINCT Category) FROM Products;

Detailed Explanation: Syntax and Behavior

1. DISTINCT on Multiple Columns

When you use DISTINCT with multiple columns, the database looks at the combination of those columns. SELECT DISTINCT City, Country FROM Customers; This will return one row for “Paris, France” and one for “Paris, Texas”. It only removes a row if both City and Country are identical to another row.

2. COUNT DISTINCT

This is a powerful tool for finding the number of unique items in a set. SELECT COUNT(DISTINCT CustomerID) FROM Orders; This tells you how many unique customers placed orders, regardless of how many total orders exist in the table.

Example: Customer Activity

We have an Orders table:

OrderID	CustomerID	OrderMonth
1	501	January
2	501	January
3	502	January
4	501	February

Task: How many unique customers placed orders in January?

SELECT COUNT(DISTINCT CustomerID) AS UniqueJanuaryCustomers
FROM Orders
WHERE OrderMonth = 'January';

Output

UniqueJanuaryCustomers
2

(Note: Total rows for January was 3, but Customer 501 was a repeat customer, so they only count once.)

Under the Hood: The Uniqueness Operation

How does the database find duplicates?

Sort: It sorts the entire result set. Once sorted, duplicates will be adjacent to each other. The engine then simply skips any row that is identical to the one before it.
Hash: It builds a hash table. As it processes rows, it checks if the “hash” of that row already exists in the table. If yes, it discards the row.

The Performance Warning

Both sorting and hashing are expensive operations that require CPU and memory. Using DISTINCT on a result set of 10 million rows can be very slow.

Summary of Aggregation Logic

Scenario	Tool
”What is the total…”	`SUM()`
”How many unique…”	`COUNT(DISTINCT ...)`
”For each category…”	`GROUP BY`
”Only show groups where…”	`HAVING`
”Remove duplicates from results…”	`DISTINCT`