Skip to content

Project 3: The Data Engineer

In this project, you will move from consuming data to architecting it. You have been given a messy, denormalized CSV file containing sales records for a global logistics company. Your job is to design and build a normalized Relational Database (3rd Normal Form) to store this data efficiently.

The raw file logistics_dump.csv has 50 columns. Every row contains redundant info like the Customer’s address, the Shipper’s phone number, and the Product’s weight. This is causing data errors and wasting space.

You need to split this one giant table into at least four related tables:

  1. Customers (ID, Name, Email, Address)
  2. Products (ID, Name, Weight, Price)
  3. Shippers (ID, CompanyName, Phone)
  4. Orders (ID, Date, CustomerID, ShipperID, Total)
  1. Draft the Schema
    Define the data types and Primary/Foreign Key relationships for your four tables.

  2. Create the Objects
    Write the CREATE TABLE statements, including all necessary NOT NULL and UNIQUE constraints.

  3. The ETL (Extract, Transform, Load) Process
    Write INSERT INTO ... SELECT DISTINCT ... statements to move data from the RawDump table into your new normalized tables.

  4. Enforce Integrity
    Add CHECK constraints to ensure prices are positive and FOREIGN KEY constraints with ON DELETE CASCADE rules.

  • Data Definition (DDL): Building the table structures.
  • Data Manipulation (DML): Using subqueries to populate your new tables without duplicates.
  • Integrity Management: Setting up the “guards” that protect your new architecture.

Create a Trigger or a View that automatically calculates the “Total Weight” for an order by summing the weight of all products associated with that order ID.

Engineering a View
CREATE VIEW OrderWeightSummary AS
SELECT
o.OrderID,
SUM(p.Weight * oi.Quantity) AS TotalWeight
FROM Orders o
JOIN OrderItems oi ON o.OrderID = oi.OrderID
JOIN Products p ON oi.ProductID = p.ID
GROUP BY 1;

A single SQL script that, when run on a fresh database:

  1. Creates the schema.
  2. Populates the data from a temporary table.
  3. Verifies the integrity of the data (e.g., checking for orphaned records).