Data Engineering Best Practices

Comprehensive guide to building robust, scalable data pipelines

1. Data Pipeline Design Patterns

Effective data pipeline design is crucial for building scalable, maintainable data infrastructure. Here are the core patterns every data engineer should know:

Batch Processing Pattern

Best for processing large volumes of data at scheduled intervals.

When to use: Daily aggregations, nightly ETL jobs, historical data processing, monthly reports
-- Example: Daily batch aggregation WITH daily_metrics AS ( SELECT DATE(timestamp) as date, user_id, COUNT(*) as event_count, SUM(revenue) as total_revenue FROM events WHERE DATE(timestamp) = CURRENT_DATE - 1 GROUP BY 1, 2 ) SELECT * FROM daily_metrics;

Stream Processing Pattern

Real-time processing of data as it arrives.

When to use: Real-time analytics, fraud detection, live dashboards, event-driven architectures

Lambda Architecture

Combines batch and stream processing for both real-time and historical analysis.

  • Batch Layer: Processes complete historical dataset
  • Speed Layer: Handles real-time data with low latency
  • Serving Layer: Merges results from both layers

Medallion Architecture (Bronze/Silver/Gold)

Progressive data refinement through layers:

  • Bronze (Raw): Ingested data as-is, minimal transformation
  • Silver (Cleaned): Validated, deduplicated, standardized
  • Gold (Business): Aggregated, business-ready metrics
-- Bronze Layer: Raw ingestion CREATE TABLE bronze.events AS SELECT * FROM source_system; -- Silver Layer: Cleaned data CREATE TABLE silver.events AS SELECT DISTINCT *, CAST(timestamp AS TIMESTAMP) as event_time, LOWER(TRIM(user_email)) as normalized_email FROM bronze.events WHERE event_time >= '2026-01-01'; -- Gold Layer: Business metrics CREATE TABLE gold.daily_user_metrics AS SELECT DATE(event_time) as date, COUNT(DISTINCT user_id) as dau, SUM(revenue) as daily_revenue FROM silver.events GROUP BY 1;

2. ETL vs ELT Strategies

ETL (Extract, Transform, Load)

Transform data before loading into the warehouse.

Pros:
  • Reduces warehouse storage costs
  • Data arrives clean and ready
  • Better for limited compute resources
Cons:
  • Less flexibility for ad-hoc analysis
  • Transformation logic scattered
  • Harder to reprocess historical data

ELT (Extract, Load, Transform)

Load raw data first, transform within the warehouse.

Pros:
  • Raw data always available for reprocessing
  • Leverage powerful warehouse compute
  • Version control transformations (e.g., dbt)
  • Faster time to insights
Cons:
  • Higher storage costs
  • Requires robust warehouse
  • More compute costs

Modern Recommendation: ELT with dbt

-- models/staging/stg_users.sql WITH source AS ( SELECT * FROM {{ source('raw', 'users') }} ), cleaned AS ( SELECT user_id, LOWER(TRIM(email)) as email, created_at, updated_at FROM source WHERE user_id IS NOT NULL ) SELECT * FROM cleaned

3. Data Quality Frameworks

Data quality is paramount. Implement these checks at every stage:

The Six Dimensions of Data Quality

  1. Completeness: No missing critical fields
  2. Validity: Data conforms to business rules
  3. Accuracy: Data correctly represents reality
  4. Consistency: Same data across systems matches
  5. Timeliness: Data is up-to-date
  6. Uniqueness: No unexpected duplicates

Implementing Data Quality Tests

-- Example dbt tests in schema.yml version: 2 models: - name: fct_orders columns: - name: order_id tests: - unique - not_null - name: customer_id tests: - not_null - relationships: to: ref('dim_customers') field: customer_id - name: order_total tests: - not_null - dbt_utils.accepted_range: min_value: 0 max_value: 1000000 - name: order_date tests: - not_null - dbt_utils.recency: datepart: day field: order_date interval: 7

Great Expectations Framework

# Python example with Great Expectations import great_expectations as ge df = ge.read_csv('data.csv') # Expect column values to not be null df.expect_column_values_to_not_be_null('user_id') # Expect column values to be unique df.expect_column_values_to_be_unique('email') # Expect column values to match regex df.expect_column_values_to_match_regex( 'email', r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$' ) # Expect column values to be between df.expect_column_values_to_be_between('age', 0, 120)

4. Schema Design Principles

Star Schema (Recommended for Analytics)

Central fact table surrounded by dimension tables.

Benefits: Simple queries, fast aggregations, easy for BI tools
-- Fact Table CREATE TABLE fct_sales ( sale_id BIGINT PRIMARY KEY, date_key INT FOREIGN KEY REFERENCES dim_date, customer_key INT FOREIGN KEY REFERENCES dim_customer, product_key INT FOREIGN KEY REFERENCES dim_product, quantity INT, revenue DECIMAL(10,2), cost DECIMAL(10,2) ); -- Dimension Tables CREATE TABLE dim_customer ( customer_key INT PRIMARY KEY, customer_id VARCHAR, customer_name VARCHAR, city VARCHAR, country VARCHAR ); CREATE TABLE dim_product ( product_key INT PRIMARY KEY, product_id VARCHAR, product_name VARCHAR, category VARCHAR, subcategory VARCHAR );

Slowly Changing Dimensions (SCD)

  • Type 1: Overwrite old values (no history)
  • Type 2: Add new row for each change (full history)
  • Type 3: Add columns for previous/current (limited history)
-- SCD Type 2 Example CREATE TABLE dim_customer_scd2 ( customer_key INT PRIMARY KEY, customer_id VARCHAR, customer_name VARCHAR, email VARCHAR, valid_from DATE, valid_to DATE, is_current BOOLEAN );

Naming Conventions

  • Staging: stg_[source]_[entity] (e.g., stg_salesforce_accounts)
  • Intermediate: int_[description] (e.g., int_customer_orders)
  • Facts: fct_[business_process] (e.g., fct_orders)
  • Dimensions: dim_[entity] (e.g., dim_customers)

5. Orchestration Best Practices

Airflow Best Practices

  • Idempotency: Tasks should produce same result when run multiple times
  • Incremental Loading: Process only new/changed data
  • Backfilling: Design for easy historical data reprocessing
  • Monitoring: Set up alerts for failures and SLA breaches
from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime, timedelta default_args = { 'owner': 'data-team', 'retries': 3, 'retry_delay': timedelta(minutes=5), 'email_on_failure': True, 'email': ['[email protected]'], 'sla': timedelta(hours=2) } dag = DAG( 'daily_etl', default_args=default_args, schedule_interval='0 2 * * *', catchup=False, # Don't backfill automatically max_active_runs=1 # Prevent parallel runs )

Task Dependencies

# Clear dependency chains extract >> transform >> load # Fan-out / Fan-in pattern extract >> [transform_a, transform_b, transform_c] >> merge >> load # Conditional execution extract >> branch_task branch_task >> [path_a, path_b]

Monitoring & Alerting

  • Track task duration trends
  • Alert on data freshness
  • Monitor data quality metrics
  • Set SLAs for critical pipelines
  • Use Circuit Breakers for cascading failures