Data Engineering Best Practices

1. Data Pipeline Design Patterns

Effective data pipeline design is crucial for building scalable, maintainable data infrastructure. Here are the core patterns every data engineer should know:

Batch Processing Pattern

Best for processing large volumes of data at scheduled intervals.

                    When to use: Daily aggregations, nightly ETL jobs, historical data processing, monthly reports
                

-- Example: Daily batch aggregation
WITH daily_metrics AS (
  SELECT 
    DATE(timestamp) as date,
    user_id,
    COUNT(*) as event_count,
    SUM(revenue) as total_revenue
  FROM events
  WHERE DATE(timestamp) = CURRENT_DATE - 1
  GROUP BY 1, 2
)
SELECT * FROM daily_metrics;

Stream Processing Pattern

Real-time processing of data as it arrives.

                    When to use: Real-time analytics, fraud detection, live dashboards, event-driven architectures
                

Lambda Architecture

Combines batch and stream processing for both real-time and historical analysis.

Batch Layer: Processes complete historical dataset
Speed Layer: Handles real-time data with low latency
Serving Layer: Merges results from both layers

Medallion Architecture (Bronze/Silver/Gold)

Progressive data refinement through layers:

Bronze (Raw): Ingested data as-is, minimal transformation
Silver (Cleaned): Validated, deduplicated, standardized
Gold (Business): Aggregated, business-ready metrics

-- Bronze Layer: Raw ingestion
CREATE TABLE bronze.events AS
SELECT * FROM source_system;

-- Silver Layer: Cleaned data
CREATE TABLE silver.events AS
SELECT 
  DISTINCT *,
  CAST(timestamp AS TIMESTAMP) as event_time,
  LOWER(TRIM(user_email)) as normalized_email
FROM bronze.events
WHERE event_time >= '2026-01-01';

-- Gold Layer: Business metrics
CREATE TABLE gold.daily_user_metrics AS
SELECT 
  DATE(event_time) as date,
  COUNT(DISTINCT user_id) as dau,
  SUM(revenue) as daily_revenue
FROM silver.events
GROUP BY 1;

2. ETL vs ELT Strategies

ETL (Extract, Transform, Load)

Transform data before loading into the warehouse.

                    Pros:
                    Reduces warehouse storage costs
Data arrives clean and ready
Better for limited compute resources

                    Cons:
                    Less flexibility for ad-hoc analysis
Transformation logic scattered
Harder to reprocess historical data

                

ELT (Extract, Load, Transform)

Load raw data first, transform within the warehouse.

                    Pros:
                    Raw data always available for reprocessing
Leverage powerful warehouse compute
Version control transformations (e.g., dbt)
Faster time to insights

                    Cons:
                    Higher storage costs
Requires robust warehouse
More compute costs

                

Modern Recommendation: ELT with dbt

-- models/staging/stg_users.sql
WITH source AS (
  SELECT * FROM {{ source('raw', 'users') }}
),

cleaned AS (
  SELECT
    user_id,
    LOWER(TRIM(email)) as email,
    created_at,
    updated_at
  FROM source
  WHERE user_id IS NOT NULL
)

SELECT * FROM cleaned

3. Data Quality Frameworks

Data quality is paramount. Implement these checks at every stage:

The Six Dimensions of Data Quality

Completeness: No missing critical fields
Validity: Data conforms to business rules
Accuracy: Data correctly represents reality
Consistency: Same data across systems matches
Timeliness: Data is up-to-date
Uniqueness: No unexpected duplicates

Implementing Data Quality Tests

-- Example dbt tests in schema.yml
version: 2

models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      
      - name: order_total
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000
      
      - name: order_date
        tests:
          - not_null
          - dbt_utils.recency:
              datepart: day
              field: order_date
              interval: 7

Great Expectations Framework

# Python example with Great Expectations
import great_expectations as ge

df = ge.read_csv('data.csv')

# Expect column values to not be null
df.expect_column_values_to_not_be_null('user_id')

# Expect column values to be unique
df.expect_column_values_to_be_unique('email')

# Expect column values to match regex
df.expect_column_values_to_match_regex(
    'email', 
    r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
)

# Expect column values to be between
df.expect_column_values_to_be_between('age', 0, 120)

4. Schema Design Principles

Star Schema (Recommended for Analytics)

Central fact table surrounded by dimension tables.

Benefits: Simple queries, fast aggregations, easy for BI tools

-- Fact Table
CREATE TABLE fct_sales (
  sale_id BIGINT PRIMARY KEY,
  date_key INT FOREIGN KEY REFERENCES dim_date,
  customer_key INT FOREIGN KEY REFERENCES dim_customer,
  product_key INT FOREIGN KEY REFERENCES dim_product,
  quantity INT,
  revenue DECIMAL(10,2),
  cost DECIMAL(10,2)
);

-- Dimension Tables
CREATE TABLE dim_customer (
  customer_key INT PRIMARY KEY,
  customer_id VARCHAR,
  customer_name VARCHAR,
  city VARCHAR,
  country VARCHAR
);

CREATE TABLE dim_product (
  product_key INT PRIMARY KEY,
  product_id VARCHAR,
  product_name VARCHAR,
  category VARCHAR,
  subcategory VARCHAR
);

Slowly Changing Dimensions (SCD)

Type 1: Overwrite old values (no history)
Type 2: Add new row for each change (full history)
Type 3: Add columns for previous/current (limited history)

-- SCD Type 2 Example
CREATE TABLE dim_customer_scd2 (
  customer_key INT PRIMARY KEY,
  customer_id VARCHAR,
  customer_name VARCHAR,
  email VARCHAR,
  valid_from DATE,
  valid_to DATE,
  is_current BOOLEAN
);

Naming Conventions

Staging: stg_[source]_[entity] (e.g., stg_salesforce_accounts)
Intermediate: int_[description] (e.g., int_customer_orders)
Facts: fct_[business_process] (e.g., fct_orders)
Dimensions: dim_[entity] (e.g., dim_customers)

5. Orchestration Best Practices

Airflow Best Practices

Idempotency: Tasks should produce same result when run multiple times
Incremental Loading: Process only new/changed data
Backfilling: Design for easy historical data reprocessing
Monitoring: Set up alerts for failures and SLA breaches

from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime, timedelta default_args = { 'owner': 'data-team', 'retries': 3, 'retry_delay': timedelta(minutes=5), 'email_on_failure': True, 'email': ['[email protected]'], 'sla': timedelta(hours=2) } dag = DAG( 'daily_etl', default_args=default_args, schedule_interval='0 2 * * *', catchup=False, # Don't backfill automatically max_active_runs=1 # Prevent parallel runs )

Task Dependencies

# Clear dependency chains
extract >> transform >> load

# Fan-out / Fan-in pattern
extract >> [transform_a, transform_b, transform_c] >> merge >> load

# Conditional execution
extract >> branch_task
branch_task >> [path_a, path_b]

Monitoring & Alerting

Track task duration trends
Alert on data freshness
Monitor data quality metrics
Set SLAs for critical pipelines
Use Circuit Breakers for cascading failures