Advanced Problems
As you move into advanced SQL, you will encounter problems that cannot be solved with simple filters or joins. These often involve analyzing patterns across rows. The most famous of these is the Gaps and Islands problem.
Definition: Gaps and Islands
Section titled “Definition: Gaps and Islands”- Islands: Contiguous sequences of data (e.g., a user who logged in for 5 days in a row).
- Gaps: Missing ranges in a sequence (e.g., identifying which ID numbers were skipped in an invoice table).
Context: The Login Streak
Section titled “Context: The Login Streak”Imagine you are building a “Gamified” app and you want to reward users who have a “7-day login streak.” The data is stored as a simple list of dates. How do you find the start, end, and duration of every streak for every user?
Detailed Explanation: The “Difference of Row Numbers” Technique
Section titled “Detailed Explanation: The “Difference of Row Numbers” Technique”The most elegant way to solve Gaps and Islands is using two ROW_NUMBER() functions.
- Calculate a row number for the whole set (
seq). - Calculate a row number for each value group (
val_seq). - The Difference between these two numbers will be constant for any contiguous “island.”
Step-by-Step Logic
Section titled “Step-by-Step Logic”| Date | RowNum | DateNum (Days since base) | Difference (Group ID) |
|---|---|---|---|
| Jan 1 | 1 | 1 | 0 |
| Jan 2 | 2 | 2 | 0 |
| Jan 3 | 3 | 3 | 0 |
| Jan 5 | 4 | 5 | 1 |
| Jan 6 | 5 | 6 | 1 |
Notice how the “Difference” jumped from 0 to 1 when the date skipped Jan 4th? This difference identifies our islands.
Example: Finding Continuous Streaks
Section titled “Example: Finding Continuous Streaks”WITH Groups AS ( SELECT UserID, LoginDate, LoginDate - (ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY LoginDate) * INTERVAL '1 day') AS GroupID FROM Logins)SELECT UserID, MIN(LoginDate) AS StreakStart, MAX(LoginDate) AS StreakEnd, COUNT(*) AS DaysInRowFROM GroupsGROUP BY UserID, GroupIDHAVING COUNT(*) >= 7;Under the Hood: Set-Based vs. Procedural
Section titled “Under the Hood: Set-Based vs. Procedural”A beginner might try to solve this by writing a loop in Python or a Stored Procedure that iterates row-by-row.
- Procedural (Loop): $O(N)$ but requires moving all data from the database to the application.
- Set-Based (SQL): $O(N \log N)$ due to the sort, but it stays “Close to the data” and can be executed in parallel by the database engine.
Other Advanced Patterns
Section titled “Other Advanced Patterns”- Median Calculation: Since SQL doesn’t have a native
MEDIAN()function in all dialects, you must usePERCENTILE_CONT(0.5). - Cumulative Sum to a Target: Finding how many orders it takes to reach $1,000 in revenue (requires a window function with a filter).