The GROUP BY Clause

Aggregating an entire table into one row is useful, but usually, we want to see statistics broken down by category—for example, total sales per region or average salary per department. This is where GROUP BY comes in.

Definition: The GROUP BY Clause

The GROUP BY clause groups rows that have the same values in specified columns into summary rows.

The Golden Rule of Aggregation

This is the most common source of errors for SQL beginners.

If you use an aggregate function (like SUM) in your SELECT list, every other column in that SELECT list MUST either be part of an aggregate function or appear in the GROUP BY clause.

Context: Logical Order Revisited

Remember the logical order of execution:

FROM
WHERE
GROUP BY
SELECT

Because GROUP BY happens before SELECT, the database first organizes the data into buckets (groups). Only then does it calculate the SUM or COUNT for each bucket.

Detailed Explanation: Syntax

SELECT column_name, AGGREGATE_FUNCTION(column_name)
FROM table_name
WHERE condition
GROUP BY column_name;

Grouping by Multiple Columns

You can group by multiple levels. For example, to see sales by Year AND Region: GROUP BY Year, Region

Example: Departmental Stats

We have an Employees table:

Name	Dept	Salary
Alice	IT	90000
Bob	Sales	70000
Charlie	IT	85000
Diana	Sales	65000

Task: Find the total payroll and headcount for each department.

SELECT
  Dept,
  COUNT(*) AS Headcount,
  SUM(Salary) AS TotalPayroll
FROM Employees
GROUP BY Dept;

Output

Dept	Headcount	TotalPayroll
IT	2	175000
Sales	2	135000

Under the Hood: The Internal Grouping Process

When the engine encounters GROUP BY Dept:

Sorting/Hashing: It either sorts the table by Dept or builds a hash map where each unique Dept name is a key.
Bucketing: As it scans the rows, it puts each row into its respective bucket.
Aggregating: It updates the running totals (COUNT, SUM) for that specific bucket.
Finalization: Once the table scan is complete, it outputs one row for every bucket created.

Performance Tip: Filtering Early

Always filter as much data as possible using WHERE before the GROUP BY.

WHERE reduces the number of rows that need to be sorted and bucketed.
The less data the GROUP BY has to process, the faster your query will run.