When the Medallion Shines Too Bright: Ten Pitfalls of Bronze-Silver-Gold Architecture

The medallion architecture (Bronze → Silver → Gold) has become the de facto standard for organizing data lakes. It's elegant, intuitive, and backed by major players like Databricks. But like any architectural pattern, blindly applying it without understanding its trade-offs can lead to complexity, waste, and frustrated data teams.

Let's explore the real pitfalls teams encounter in production - and how to avoid them.

The medallion promise

First, let's acknowledge why the medallion architecture became popular:

Layer	Purpose	Benefits
Bronze	Raw data ingestion	Complete audit trail, reprocessing capability
Silver	Cleaned, deduplicated data	Standardized schemas, quality enforcement
Gold	Business-ready aggregates	Optimized for consumption, denormalized

This progression makes intuitive sense. Raw data flows through refinement stages until it's ready for business use. What could go wrong?

Bronze layer pitfalls

#1: Overloading the bronze layer

The problem: Teams dump everything into Bronze - raw data, semi-processed files, reprocessed corrections, backfills, and "emergency fixes."

# What Bronze becomes:
bronze/
  raw_events/          # Original ingestion
  raw_events_v2/       # "Fixed" version
  raw_events_backfill/ # Historical load
  raw_events_hotfix/   # Manual correction
  # Which one is canonical?

The effect:

Hard to tell what's the source of truth
Duplicate ingestion and storage bloat
Queries scan multiple directories unsure which to use
Pipeline logic becomes "which Bronze table do I read from?"

The fix: Enforce a clear policy:

# Bronze = raw + immutable + append-only
# One rule: If it has ANY transformation logic, it's not Bronze
bronze_orders = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .load("s3://source/orders/"//source/orders/")
    .withColumn("_ingested_at", current_timestamp())
    .withColumn("_source_file", input_file_name())
    .writeStream
    .format("delta")
    .option("checkpointLocation", "s3://lake/checkpoints/bronze_orders"//lake/checkpoints/bronze_orders")
    .trigger(availableNow=True)
    .toTable("bronze.orders")  # Append-only, immutable
)

#2: Schema drift and bad data

The problem: Source systems evolve - extra columns appear, fields get renamed, data types change. Bronze just stores whatever arrives, and Silver ETL jobs start breaking.

# Monday: Schema looks fine
bronze_users = spark.read.table("bronze.users")
# Columns: user_id, email, created_at
# Tuesday: Source system adds phone_number
# Wednesday: Your Silver job fails
silver_users = bronze_users.select("user_id", "email", "created_at")
# Error: Unknown column 'phone_number' breaks the schema

The effect:

Pipelines fail mysteriously when source schemas change
Data quality issues propagate silently
Can't tell when schema evolved vs. when it broke

The fix: Use schema evolution and versioning:

# Option 1: Delta Lake schema evolution
bronze_users = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "s3://schemas/users/"//schemas/users/")  # Track schema changes
    .option("cloudFiles.inferColumnTypes", "true")
    .load("s3://source/users/"//source/users/")
    .writeStream
    .format("delta")
    .option("mergeSchema", "true")  # Allow schema evolution
    .toTable("bronze.users")
)
# Option 2: Schema validation with fallback
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
expected_schema = StructType([
    StructField("user_id", StringType(), False),
    StructField("email", StringType(), True),
    StructField("created_at", TimestampType(), True)
])
bronze_raw = spark.read.json("s3://source/users/"//source/users/")
# Check for schema drift
actual_columns = set(bronze_raw.columns)
expected_columns = set([f.name for f in expected_schema.fields])
new_columns = actual_columns - expected_columns
if new_columns:
    log_schema_change(new_columns)  # Alert data team
    # Store everything, extra columns preserved
    bronze_raw.write.mode("append").saveAsTable("bronze.users")

#3: Storage cost explosion

The problem: Three copies of every byte adds up fast.

Let's do some math with a typical data platform storing 100TB:

Approach	Layer	Retention	Size	Monthly Cost (S3)
Full Medallion	Bronze (2 years)	2 years	200 TB	$4,600
	Silver (2 years)	2 years	200 TB	$4,600
	Gold (2 years)	2 years	200 TB	$4,600
	Total		600 TB	$13,800/month
Optimized	Bronze (90 days)	90 days	25 TB	$575
	Silver (1 year)	1 year	100 TB	$2,300
	Gold (2 years)	2 years	200 TB	$4,600
	Total		325 TB	$7,475/month

The "optimized" approach saves over $6,000/month (46% reduction) by keeping Gold data longer (where it matters for business) and Bronze shorter (where it's just insurance).

The reality: Bronze is often just insurance. Ask yourself:

How often do you actually reprocess from Bronze?
Could you re-ingest from the source system if needed?
Does keeping raw JSON for years serve any purpose?

Better approach: Implement tiered retention based on business value:

# Lifecycle policy example
lifecycle_rules = {
    "bronze": {
        "hot": "30 days",      # S3 Standard
        "warm": "60 days",     # S3 Infrequent Access
        "archive": "90 days",  # S3 Glacier
        "delete": "180 days"   # Gone
    },
    "silver": {
        "hot": "1 year",       # Query-optimized
        "archive": "3 years"   # Compliance
    },
    "gold": {
        "hot": "2 years",      # Business-critical, optimized for queries
        "no_deletion": True    # Keep indefinitely for historical analysis
    }
}

Silver layer pitfalls

#4: Transformation bloat

The problem: Cramming all cleansing, enrichment, joins, aggregations, and validation into a single monster job.

# The 1000-line Silver transformation job from hell
def create_silver_user_metrics(spark):
    # Step 1: Read from 5 Bronze sources
    # Step 2: Clean and standardize (200 lines)
    # Step 3: Join everything (150 lines)
    # Step 4: Enrich with external APIs (100 lines)
    # Step 5: Validate (50 lines)
    # Step 6: Aggregate (200 lines)
    # Step 7: Apply complex business logic (300 lines)
    # If ANY step fails, debug all 1000 lines

The effect:

Complex DAGs with poor modularity
Long runtimes (can't optimize individual steps)
Difficult to test, debug, or parallelize
One small change requires full redeployment

The fix: Split transformations into logical stages:

# Bronze → Silver: Cleanse and standardize ONLY
def bronze_to_silver_orders(spark):
    return (
        spark.read.table("bronze.orders")
        .filter(col("order_id").isNotNull())  # Basic validation
        .withColumn("email", lower(trim(col("email"))))  # Standardize
        .withColumn("amount", col("amount").cast("decimal(10,2)"))  # Type safety
        .dropDuplicates(["order_id", "order_timestamp"])  # Dedupe
        .write.mode("overwrite").saveAsTable("silver.orders")
    )
# Silver → Silver: Enrichment (separate job)
def enrich_silver_orders(spark):
    orders = spark.read.table("silver.orders")
    customers = spark.read.table("silver.customers")
    return (
        orders.join(customers, "customer_id", "left")
        .write.mode("overwrite").saveAsTable("silver.orders_enriched")
    )
# Silver → Gold: Aggregate and serve (separate job)
def create_gold_daily_revenue(spark):
    return (
        spark.read.table("silver.orders_enriched")
        .groupBy("order_date", "region")
        .agg(sum("amount").alias("revenue"))
        .write.mode("overwrite").saveAsTable("gold.daily_revenue")
    )

#5: Inconsistent keys and deduplication

The problem: Different pipelines use different surrogate key strategies and deduplication logic.

# Pipeline A: Uses MD5 hash
silver_users_a = bronze.withColumn("user_key", md5(concat("email", "signup_date")))
# Pipeline B: Uses UUID
silver_users_b = bronze.withColumn("user_key", expr("uuid()"))
# Pipeline C: Uses source system ID
silver_users_c = bronze.withColumn("user_key", col("source_user_id"))
# Now joins between Silver tables are inconsistent
# Customer support: "Why does user X have 3 different IDs?"

The effect:

Joins behave inconsistently
Upserts fail silently
Duplicate records proliferate
Data quality reports are unreliable

The fix: Centralize key generation in Silver:

# Create a canonical key management module
def generate_canonical_key(source_system, natural_keys):
    """
    Centralized key generation for all Silver tables.
    Args:
        source_system: Source system name (e.g., 'salesforce', 'mysql')
        natural_keys: Business keys from source (list of column names)
    """
    key_columns = [lit(source_system)] + natural_keys
    return sha2(concat_ws("||", *key_columns), 256)
# Use consistently across all Silver tables
silver_customers = (
    bronze.customers
    .withColumn("customer_key",
        generate_canonical_key("salesforce", [col("account_id")])
    )
)
silver_orders = (
    bronze.orders
    .withColumn("order_key",
        generate_canonical_key("shopify", [col("order_id")])
    )
    .withColumn("customer_key",  # Use same key for joins
        generate_canonical_key("salesforce", [col("customer_account_id")])
    )
)
# Document key strategy in data catalog
"""
Silver Key Standards:
- All Silver tables have a {entity}_key column
- Format: SHA256(source_system || natural_key_1 || natural_key_2 || ...)
- Deduplication: Always use row_number() over (partition by {entity}_key order by _updated_at desc)
"""

Gold layer pitfalls

#6: Business logic drift

The problem: Analysts encode business logic (KPIs, dimensions) differently across Gold tables and dashboards.

# Data Engineer's definition of "Active User"
gold_active_users = users.filter("last_login > current_date() - 30")
# Analyst A's definition in Tableau
# "Active User" = login in last 7 days
# Analyst B's definition in Python
# "Active User" = made a purchase in last 14 days
# CFO asks: "How many active users do we have?"
# Three different answers, three different people, same meeting

The effect:

"One metric, five definitions" chaos
Stakeholder confusion and mistrust
Time wasted reconciling different numbers
Compliance risks (especially for financial metrics)

The fix: Centralize metrics in a semantic layer:

# Option 1: [dbt metrics](https://docs.getdbt.com/docs/build/metrics)
# models/metrics.yml
metrics:
  - name: active_users
    label: Active Users (30-day)
    model: ref('gold_users')
    calculation_method: count_distinct
    expression: user_id
    timestamp: activity_date
    time_grains: [day, week, month]
    dimensions:
      - region
      - user_type
    filters:
      - field: last_activity_date
        operator: '>'
        value: "current_date - interval '30 days'"
    meta:
      owner: data-team@company.com
      definition: "Users with any activity in the last 30 days"
# Option 2: Build a metrics layer in Gold
CREATE OR REPLACE VIEW gold.metrics_active_users AS
WITH user_activity AS (
  SELECT
    user_id,
    MAX(activity_date) as last_activity_date
  FROM silver.user_events
  GROUP BY user_id
),
active_users AS (
  SELECT
    COUNT(DISTINCT user_id) as active_user_count,
    CURRENT_DATE() as metric_date
  FROM user_activity
  WHERE last_activity_date > CURRENT_DATE() - INTERVAL '30 days'
)
SELECT * FROM active_users;
-- Document the metric definition
COMMENT ON VIEW gold.metrics_active_users IS
'Official definition of Active Users: Users with any activity in last 30 days.
Owner: Data Team (data-team@company.com)
Updated: Daily at 6 AM UTC
Dependencies: silver.user_events
Do not create alternate definitions - use this view as source of truth.';

#7: Materialization explosion

The problem: Every analyst or team builds their own "Gold table" for their specific use case.

# After 6 months in production:
gold.sales_dashboard_v1
gold.sales_dashboard_v2
gold.sales_dashboard_final
gold.sales_dashboard_actually_final
gold.sales_team_metrics
gold.sales_executive_summary
gold.sales_daily  # Updated daily
gold.sales_hourly  # Updated hourly but has same data as sales_daily
gold.marketing_sales  # Duplicate of sales_team_metrics with 2 extra columns
# ...100+ more Gold tables

The effect:

Massive storage bloat
Inconsistent refresh schedules
Redundant compute
Confusion about which table to use
Maintenance nightmare

The fix: Govern the Gold layer with data contracts:

# Require approval for new Gold tables via PR process
# gold_table_registry.yaml
gold_tables:
  - name: gold.daily_revenue
    owner: finance_team
    refresh: daily_6am
    consumers:
      - finance_dashboard
      - executive_report
    retention: 2_years
    approved_by: data_governance_team
    created_date: 2024-01-15
  - name: gold.user_segments
    owner: marketing_team
    refresh: weekly_monday
    consumers:
      - email_campaigns
      - marketing_dashboard
    retention: 1_year
    approved_by: data_governance_team
    created_date: 2024-02-01
# Automated check in CI/CD
def validate_gold_table_request(table_name, definition):
    registry = load_registry("gold_table_registry.yaml")
    # Check if similar table exists
    similar_tables = find_similar_tables(table_name, registry)
    if similar_tables:
        raise Exception(
            f"Similar Gold tables already exist: {similar_tables}\n"
            f"Consider reusing or extending existing table instead."
        )
    # Require approval
    if not has_governance_approval(table_name):
        raise Exception(
            f"Gold table {table_name} requires data governance approval.\n"
            f"Submit request via: https://data-governance.company.com/request"
        )

Operational and cross-cutting pitfalls

#8: Lack of incremental processing

The problem: Full refreshes of Silver/Gold tables even when only 1% of data changes.

# Every night: Full refresh of 10TB table
silver_orders = (
    spark.read.table("bronze.orders")  # Read ALL Bronze data
    .transform(clean_and_dedupe)
    .write
    .mode("overwrite")  # Rewrite entire table
    .saveAsTable("silver.orders")
)
# Cost: $500/night in compute
# Time: 4 hours
# Actual new data: 50GB (0.5%)

The effect:

High compute costs
Long runtimes
Job failures (can't finish in batch window)
Unnecessary load on source systems

The fix: Implement incremental processing:

# CDC pattern with merge
from delta.tables import DeltaTable
def incremental_bronze_to_silver(table_name, watermark_column="_ingested_at"):
    # Get last processed watermark
    last_watermark = (
        spark.read.table(f"silver.{table_name}")
        .selectExpr(f"MAX({watermark_column}) as max_watermark")
        .collect()[0]["max_watermark"]
    )
    # Read only new Bronze data
    new_bronze_data = (
        spark.read.table(f"bronze.{table_name}")
        .filter(col(watermark_column) > last_watermark)  # Incremental
        .transform(clean_and_dedupe)
    )
    # Merge into Silver
    silver_table = DeltaTable.forName(spark, f"silver.{table_name}")
    silver_table.alias("target").merge(
        new_bronze_data.alias("source"),
        "target.id = source.id"
    ).whenMatchedUpdateAll(
    ).whenNotMatchedInsertAll(
    ).execute()
# Cost: $10/night (50GB incremental)
# Time: 15 minutes
# Same result as full refresh

#9: No late-arriving data handling

The problem: Data arriving out-of-order or re-sent isn't handled properly.

# Monday: Process orders for March 15
gold_daily_revenue = orders.filter("order_date = '2024-03-15'").agg(sum("amount"))
# Result: $100,000
# Wednesday: Late orders for March 15 arrive
# But your job only processes "new" data from Wednesday forward
# March 15 revenue is still $100,000 (wrong!)
# Actual March 15 revenue: $105,000

The effect:

Downstream aggregates become stale or incorrect
Historical data is wrong
Reports show different numbers depending on when they ran
Audits fail

The fix: Implement watermarks and idempotent merges:

# Set watermark for late-arriving data (e.g., allow 7 days late)
LATE_ARRIVAL_WINDOW_DAYS = 7
def process_with_late_arrival_handling(business_date):
    # Reprocess window: today minus late arrival window
    reprocess_start = business_date - timedelta(days=LATE_ARRIVAL_WINDOW_DAYS)
    # Read Silver data for reprocessing window
    orders_to_process = (
        spark.read.table("silver.orders")
        .filter(
            (col("order_date") >= reprocess_start) &
            (col("order_date") <= business_date)
        )
    )
    # Aggregate by business date
    daily_revenue = (
        orders_to_process
        .groupBy("order_date", "region")
        .agg(
            sum("amount").alias("revenue"),
            count("order_id").alias("order_count"),
            current_timestamp().alias("_calculated_at")
        )
    )
    # Merge into Gold (idempotent - safe to rerun)
    gold_table = DeltaTable.forName(spark, "gold.daily_revenue")
    gold_table.alias("target").merge(
        daily_revenue.alias("source"),
        "target.order_date = source.order_date AND target.region = source.region"
    ).whenMatchedUpdate(
        set={
            "revenue": "source.revenue",
            "order_count": "source.order_count",
            "_calculated_at": "source._calculated_at"
        }
    ).whenNotMatchedInsert(
        values={
            "order_date": "source.order_date",
            "region": "source.region",
            "revenue": "source.revenue",
            "order_count": "source.order_count",
            "_calculated_at": "source._calculated_at"
        }
    ).execute()
# Now late-arriving data for March 15 updates the existing record
# Idempotent: Running twice produces same result

#10: Weak data quality contracts

The problem: ETL assumes "Bronze is fine" - but it's not. Nulls, bad joins, and wrong types propagate silently.

# Silver job assumes Bronze is clean
silver_orders = (
    bronze_orders
    .join(bronze_customers, "customer_id")  # Assumes customer_id exists
    .selectExpr("order_id", "customer_name", "CAST(amount AS DOUBLE)")  # Assumes amount is numeric
)
# Reality:
# - 15% of bronze_orders have null customer_id → Join silently drops them
# - 2% of amounts are "N/A" → Cast returns null
# - Nobody notices until the CEO asks why revenue is down 17%

The effect:

Silent data loss
Incorrect aggregations
Broken dashboards discovered in stakeholder meetings
Loss of trust in data platform

The fix: Implement expectations at every layer boundary:

# Option 1: [Great Expectations](https://greatexpectations.io/)
import great_expectations as gx
def validate_bronze_orders(df):
    expectation_suite = gx.core.ExpectationSuite(expectation_suite_name="bronze_orders")
    # Define expectations
    expectations = [
        {"expectation_type": "expect_column_to_exist", "column": "order_id"},
        {"expectation_type": "expect_column_values_to_not_be_null", "column": "order_id"},
        {"expectation_type": "expect_column_values_to_be_unique", "column": "order_id"},
        {"expectation_type": "expect_column_values_to_be_between",
         "column": "amount", "min_value": 0, "max_value": 100000},
        {"expectation_type": "expect_column_values_to_not_be_null",
         "column": "customer_id", "mostly": 0.95},  # Allow 5% null
    ]
    # Validate
    results = df.validate(expectation_suite)
    if not results["success"]:
        alert_data_quality_failure(results)
        raise DataQualityException(f"Bronze validation failed: {results}")
    return df
# Option 2: dbt tests
# models/silver/orders.sql
{{
  config(
    materialized='table'
  )
}}
SELECT * FROM {{ ref('bronze_orders') }}
-- tests/silver/orders_tests.yml
version: 2
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('silver_customers')
              field: customer_id
# Option 3: [Delta table constraints](https://docs.delta.io/latest/delta-constraints.html)
spark.sql("""
  ALTER TABLE silver.orders ADD CONSTRAINT valid_amount CHECK (amount >= 0);
  ALTER TABLE silver.orders ADD CONSTRAINT valid_order_id CHECK (order_id IS NOT NULL);
  ALTER TABLE silver.orders ADD CONSTRAINT valid_customer
    FOREIGN KEY (customer_id) REFERENCES silver.customers(customer_id);
""")
# Violations automatically rejected at write time

When medallion architecture works best

Despite these pitfalls, medallion architecture is excellent for:

Use Case	Why It Works
Complex, messy source data	Bronze provides raw backup; Silver does heavy lifting
Multiple consumers with different needs	Gold layer can serve many use cases from one Silver layer
Regulated industries	Bronze provides audit trail; clear transformation lineage
Machine learning pipelines	Silver provides clean features; Gold provides labeled datasets
Data with complex business rules	Silver enforces rules; Gold optimizes for access patterns

Decision framework: Do you need all three layers?

Consideration	Bronze → Silver → Gold	Bronze → Gold	Silver → Gold
Source data quality	Messy, evolving schema	Clean, structured	Already in lake
Transformation complexity	High (5+ steps)	Low (1-2 steps)	Medium
Reprocessing needs	Frequent	Rare	N/A
Compliance/audit	Required	Not required	Partial
Multiple consumers	Yes (5+)	Single use case	Multiple
Latency requirements	Batch (hours)	Real-time	Mixed
Storage budget	Generous	Constrained	Constrained

The medallion architecture is a powerful pattern - but it's a starting point, not a mandate. The best data architectures are:

Purpose-driven: Each layer solves a specific problem
Cost-conscious: Don't store or compute "just in case"
Quality-first: Test and validate at every boundary
Incremental: Process only what changed
Governed: Control Gold layer sprawl
Observable: Track lineage and dependencies
Evolvable: Handle schema changes gracefully

Remember: Bronze, Silver, and Gold are colors, not commandments. Your architecture should serve your business requirements, not the other way around.

The best data engineers know when to follow the pattern - and when to break it.

Related reading:

Data Platform Observability Best Practices - Essential monitoring for medallion architectures
Hidden Costs of Data Platform Technical Debt - Avoiding architectural debt accumulation

Struggling with data architecture decisions or dealing with medallion architecture pitfalls? Our team at Black Dog Labs helps companies design pragmatic, cost-effective data platforms that balance best practices with real-world constraints. Contact us to discuss your challenges.

Back to all posts