Skip to main content

Dimensional Modeling 101 - Design Better Data Warehouses

Tom FynesTom Fynes
6 min read

If you're building a data warehouse, dimensional modeling is your best friend. It makes data easier to query, understand, and analyze. Let's break down this powerful technique!

What is Dimensional Modeling?

Dimensional modeling is a design technique optimized for data retrieval and analytics. Unlike transactional databases (OLTP) which are optimized for writes, dimensional models (OLAP) are built for blazing-fast reads and intuitive queries.

Think of it as organizing your data warehouse like a well-structured library, where finding what you need is quick and intuitive.

The Two Key Components

Fact Tables: Contain measurements, metrics, or facts about your business (sales, clicks, transactions)

Dimension Tables: Contain descriptive attributes about the facts (who, what, when, where, why)

Star Schema: The Foundation

The most common dimensional model is the star schema. It looks like this:

        Dim_Date
           |
           |
Dim_Product --- Fact_Sales --- Dim_Customer
           |
           |
        Dim_Store

One central fact table surrounded by dimension tables - like a star!

Let's Build an Example: E-Commerce Sales

Here's a practical example of dimensional modeling for an e-commerce business:

-- Fact Table: Contains the measurements
CREATE TABLE fact_sales (
    sale_id BIGSERIAL PRIMARY KEY,
    date_key INTEGER NOT NULL,
    product_key INTEGER NOT NULL,
    customer_key INTEGER NOT NULL,
    store_key INTEGER NOT NULL,
    
    -- Measurements (facts)
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2) DEFAULT 0,
    tax_amount DECIMAL(10,2) NOT NULL,
    total_amount DECIMAL(10,2) NOT NULL,
    cost_amount DECIMAL(10,2) NOT NULL,
    profit_amount DECIMAL(10,2) NOT NULL,
    
    -- Foreign keys to dimensions
    FOREIGN KEY (date_key) REFERENCES dim_date(date_key),
    FOREIGN KEY (product_key) REFERENCES dim_product(product_key),
    FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key),
    FOREIGN KEY (store_key) REFERENCES dim_store(store_key)
);

-- Dimension Table: Date
CREATE TABLE dim_date (
    date_key INTEGER PRIMARY KEY,
    date DATE NOT NULL,
    day_of_week VARCHAR(10),
    day_of_month INTEGER,
    day_of_year INTEGER,
    week_of_year INTEGER,
    month INTEGER,
    month_name VARCHAR(10),
    quarter INTEGER,
    year INTEGER,
    is_weekend BOOLEAN,
    is_holiday BOOLEAN,
    holiday_name VARCHAR(50)
);

-- Dimension Table: Product
CREATE TABLE dim_product (
    product_key SERIAL PRIMARY KEY,
    product_id VARCHAR(50) NOT NULL, -- Business key
    product_name VARCHAR(200) NOT NULL,
    brand VARCHAR(100),
    category VARCHAR(100),
    subcategory VARCHAR(100),
    unit_cost DECIMAL(10,2),
    supplier_name VARCHAR(100),
    
    -- SCD Type 2 fields (we'll explain this later)
    effective_date DATE NOT NULL,
    expiration_date DATE,
    is_current BOOLEAN DEFAULT TRUE,
    
    UNIQUE(product_id, effective_date)
);

-- Dimension Table: Customer
CREATE TABLE dim_customer (
    customer_key SERIAL PRIMARY KEY,
    customer_id VARCHAR(50) NOT NULL,
    customer_name VARCHAR(200),
    email VARCHAR(100),
    customer_segment VARCHAR(50),
    city VARCHAR(100),
    state VARCHAR(50),
    country VARCHAR(50),
    signup_date DATE,
    
    effective_date DATE NOT NULL,
    expiration_date DATE,
    is_current BOOLEAN DEFAULT TRUE
);

-- Dimension Table: Store
CREATE TABLE dim_store (
    store_key SERIAL PRIMARY KEY,
    store_id VARCHAR(50) NOT NULL,
    store_name VARCHAR(100),
    store_type VARCHAR(50),
    address VARCHAR(200),
    city VARCHAR(100),
    state VARCHAR(50),
    region VARCHAR(50),
    manager_name VARCHAR(100),
    opening_date DATE
);

Querying is Beautiful

With this model, complex analytics become simple:

-- Total sales by product category and month
SELECT 
    p.category,
    d.year,
    d.month_name,
    SUM(f.total_amount) as total_sales,
    SUM(f.profit_amount) as total_profit,
    COUNT(DISTINCT f.customer_key) as unique_customers
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2024
  AND p.is_current = TRUE
GROUP BY p.category, d.year, d.month_name
ORDER BY total_sales DESC;

-- Customer segmentation analysis
SELECT 
    c.customer_segment,
    c.country,
    COUNT(DISTINCT f.customer_key) as customer_count,
    AVG(f.total_amount) as avg_order_value,
    SUM(f.quantity) as total_items_sold
FROM fact_sales f
JOIN dim_customer c ON f.customer_key = c.customer_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.date >= CURRENT_DATE - INTERVAL '90 days'
  AND c.is_current = TRUE
GROUP BY c.customer_segment, c.country;

Notice how easy these queries are to read and write!

Slowly Changing Dimensions (SCD)

What happens when a customer moves to a new city? Or a product's price changes? We need to track history!

Type 1: Overwrite (No History)

Simply update the record - easiest but loses history:

UPDATE dim_customer
SET city = 'New York', state = 'NY'
WHERE customer_id = 'C12345' AND is_current = TRUE;

Type 2: Add New Row (Full History)

Most common approach - create a new record:

def update_customer_scd_type2(customer_id, new_city, new_state):
    """Handle SCD Type 2 update for customer dimension"""
    
    # Expire the old record
    cursor.execute("""
        UPDATE dim_customer
        SET expiration_date = CURRENT_DATE,
            is_current = FALSE
        WHERE customer_id = %s AND is_current = TRUE
    """, (customer_id,))
    
    # Insert new record
    cursor.execute("""
        INSERT INTO dim_customer 
        (customer_id, customer_name, city, state, 
         effective_date, is_current)
        SELECT 
            customer_id, customer_name, %s, %s,
            CURRENT_DATE, TRUE
        FROM dim_customer
        WHERE customer_id = %s AND expiration_date = CURRENT_DATE
    """, (new_city, new_state, customer_id))

Now you can analyze sales with historical customer locations!

Type 3: Add New Column (Limited History)

Track just current and previous values:

ALTER TABLE dim_customer
ADD COLUMN previous_city VARCHAR(100),
ADD COLUMN previous_state VARCHAR(50),
ADD COLUMN city_change_date DATE;

UPDATE dim_customer
SET previous_city = city,
    previous_state = state,
    city = 'New York',
    state = 'NY',
    city_change_date = CURRENT_DATE
WHERE customer_id = 'C12345';

Date Dimension: The Workhorse

The date dimension deserves special attention. Here's how to populate it:

import pandas as pd
from datetime import datetime, timedelta

def generate_date_dimension(start_date, end_date):
    """Generate a complete date dimension table"""
    
    dates = pd.date_range(start=start_date, end=end_date, freq='D')
    
    date_dim = pd.DataFrame({
        'date_key': dates.strftime('%Y%m%d').astype(int),
        'date': dates,
        'day_of_week': dates.day_name(),
        'day_of_month': dates.day,
        'day_of_year': dates.dayofyear,
        'week_of_year': dates.isocalendar().week,
        'month': dates.month,
        'month_name': dates.month_name(),
        'quarter': dates.quarter,
        'year': dates.year,
        'is_weekend': dates.dayofweek.isin([5, 6])
    })
    
    # Add holiday logic (simplified)
    holidays = ['2024-01-01', '2024-07-04', '2024-12-25']
    date_dim['is_holiday'] = date_dim['date'].astype(str).isin(holidays)
    
    return date_dim

# Generate 5 years of dates
date_dim = generate_date_dimension('2020-01-01', '2024-12-31')
date_dim.to_sql('dim_date', engine, if_exists='replace', index=False)

Fact Tables: Additive vs. Non-Additive

Additive Facts: Can be summed across all dimensions (quantity, amount)

-- This makes sense
SELECT SUM(total_amount) FROM fact_sales;

Semi-Additive: Can be summed across some dimensions (inventory levels)

-- Summing inventory across time doesn't make sense
-- But summing across products does
SELECT product_key, SUM(units_in_stock)
FROM fact_inventory
WHERE date_key = 20240115
GROUP BY product_key;

Non-Additive: Can't be summed (ratios, percentages)

-- Don't sum percentages or ratios!
-- Instead, recalculate them
SELECT 
    category,
    SUM(total_amount) / SUM(quantity) as avg_price
FROM fact_sales
GROUP BY category;

Best Practices

Use surrogate keys: Integer keys (product_key) instead of natural keys (product_id) for better performance

Denormalize dimensions: Include redundant data to avoid joins within dimensions

-- Good: Denormalized
dim_product: product_id, name, category, subcategory, brand

-- Avoid: Normalized (requires joins)
dim_product: product_id, name, category_id
dim_category: category_id, category_name, subcategory_id

Keep fact tables narrow: Only include true measurements

Pre-calculate when possible: Add calculated fields to facts

-- Add derived facts at load time
profit_amount = total_amount - cost_amount
profit_margin = profit_amount / total_amount

Snowflake Schema: When to Use It

Sometimes you'll hear about snowflake schemas - these are normalized dimensional models:

              Dim_Category
                   |
Dim_Date --- Fact_Sales --- Dim_Product --- Dim_Brand
                   |
              Dim_Customer --- Dim_Segment

Use snowflake schemas only when:

  • Storage is extremely expensive (rare nowadays)
  • Dimension tables are massive
  • You need to enforce referential integrity

For most use cases, stick with star schemas!

Tools to Help

Wrapping Up

Dimensional modeling makes your data warehouse intuitive and performant. Start with a simple star schema, use surrogate keys, and let your fact and dimension tables do the heavy lifting.

Your analysts will thank you for the simple, fast queries!

Enjoyed this post? Want more content like this? Check out my other posts.

Tom Fynes

Tom Fynes

Senior Data Engineer @ Fynes Forge

Writing about data engineering, pipelines, and the tools that make it all work.