Skip to content

Advanced Problems

As you move into advanced SQL, you will encounter problems that cannot be solved with simple filters or joins. These often involve analyzing patterns across rows. The most famous of these is the Gaps and Islands problem.

  • Islands: Contiguous sequences of data (e.g., a user who logged in for 5 days in a row).
  • Gaps: Missing ranges in a sequence (e.g., identifying which ID numbers were skipped in an invoice table).

Imagine you are building a “Gamified” app and you want to reward users who have a “7-day login streak.” The data is stored as a simple list of dates. How do you find the start, end, and duration of every streak for every user?

Detailed Explanation: The “Difference of Row Numbers” Technique

Section titled “Detailed Explanation: The “Difference of Row Numbers” Technique”

The most elegant way to solve Gaps and Islands is using two ROW_NUMBER() functions.

  1. Calculate a row number for the whole set (seq).
  2. Calculate a row number for each value group (val_seq).
  3. The Difference between these two numbers will be constant for any contiguous “island.”
DateRowNumDateNum (Days since base)Difference (Group ID)
Jan 1110
Jan 2220
Jan 3330
Jan 5451
Jan 6561

Notice how the “Difference” jumped from 0 to 1 when the date skipped Jan 4th? This difference identifies our islands.

Gaps and Islands Solution
WITH Groups AS (
SELECT
UserID,
LoginDate,
LoginDate - (ROW_NUMBER() OVER(PARTITION BY UserID ORDER BY LoginDate) * INTERVAL '1 day') AS GroupID
FROM Logins
)
SELECT
UserID,
MIN(LoginDate) AS StreakStart,
MAX(LoginDate) AS StreakEnd,
COUNT(*) AS DaysInRow
FROM Groups
GROUP BY UserID, GroupID
HAVING COUNT(*) >= 7;

A beginner might try to solve this by writing a loop in Python or a Stored Procedure that iterates row-by-row.

  • Procedural (Loop): $O(N)$ but requires moving all data from the database to the application.
  • Set-Based (SQL): $O(N \log N)$ due to the sort, but it stays “Close to the data” and can be executed in parallel by the database engine.
  • Median Calculation: Since SQL doesn’t have a native MEDIAN() function in all dialects, you must use PERCENTILE_CONT(0.5).
  • Cumulative Sum to a Target: Finding how many orders it takes to reach $1,000 in revenue (requires a window function with a filter).