Distinct Values
Data often contains repetitive information. To find unique entries or count how many unique categories exist, we use the DISTINCT keyword.
Definition: The DISTINCT Keyword
Section titled “Definition: The DISTINCT Keyword”DISTINCT is used in the SELECT clause to remove duplicate rows from the result set. It ensures that every row returned is unique across all selected columns.
Context: Usage Patterns
Section titled “Context: Usage Patterns”DISTINCT is used in two primary ways:
- Row-level uniqueness:
SELECT DISTINCT Category FROM Products; - Inside an aggregate:
SELECT COUNT(DISTINCT Category) FROM Products;
Detailed Explanation: Syntax and Behavior
Section titled “Detailed Explanation: Syntax and Behavior”1. DISTINCT on Multiple Columns
Section titled “1. DISTINCT on Multiple Columns”When you use DISTINCT with multiple columns, the database looks at the combination of those columns.
SELECT DISTINCT City, Country FROM Customers;
This will return one row for “Paris, France” and one for “Paris, Texas”. It only removes a row if both City and Country are identical to another row.
2. COUNT DISTINCT
Section titled “2. COUNT DISTINCT”This is a powerful tool for finding the number of unique items in a set.
SELECT COUNT(DISTINCT CustomerID) FROM Orders;
This tells you how many unique customers placed orders, regardless of how many total orders exist in the table.
Example: Customer Activity
Section titled “Example: Customer Activity”We have an Orders table:
| OrderID | CustomerID | OrderMonth |
|---|---|---|
| 1 | 501 | January |
| 2 | 501 | January |
| 3 | 502 | January |
| 4 | 501 | February |
Task: How many unique customers placed orders in January?
Section titled “Task: How many unique customers placed orders in January?”SELECT COUNT(DISTINCT CustomerID) AS UniqueJanuaryCustomersFROM OrdersWHERE OrderMonth = 'January';Output
Section titled “Output”| UniqueJanuaryCustomers |
|---|
| 2 |
(Note: Total rows for January was 3, but Customer 501 was a repeat customer, so they only count once.)
Under the Hood: The Uniqueness Operation
Section titled “Under the Hood: The Uniqueness Operation”How does the database find duplicates?
- Sort: It sorts the entire result set. Once sorted, duplicates will be adjacent to each other. The engine then simply skips any row that is identical to the one before it.
- Hash: It builds a hash table. As it processes rows, it checks if the “hash” of that row already exists in the table. If yes, it discards the row.
The Performance Warning
Section titled “The Performance Warning”Both sorting and hashing are expensive operations that require CPU and memory. Using DISTINCT on a result set of 10 million rows can be very slow.
Summary of Aggregation Logic
Section titled “Summary of Aggregation Logic”| Scenario | Tool |
|---|---|
| ”What is the total…” | SUM() |
| ”How many unique…” | COUNT(DISTINCT ...) |
| ”For each category…” | GROUP BY |
| ”Only show groups where…” | HAVING |
| ”Remove duplicates from results…” | DISTINCT |