When the Medallion Shines Too Bright: Ten Pitfalls of Bronze-Silver-Gold Architecture
The medallion architecture (Bronze → Silver → Gold) has become the de facto standard for organizing data lakes. It's elegant, intuitive, and backed by major players like Databricks. But like any architectural pattern, blindly applying it without understanding its trade-offs can lead to complexity, waste, and frustrated data teams.
Let's explore the real pitfalls teams encounter in production - and how to avoid them.
The medallion promise
First, let's acknowledge why the medallion architecture became popular:
Layer | Purpose | Benefits |
---|---|---|
Bronze | Raw data ingestion | Complete audit trail, reprocessing capability |
Silver | Cleaned, deduplicated data | Standardized schemas, quality enforcement |
Gold | Business-ready aggregates | Optimized for consumption, denormalized |
This progression makes intuitive sense. Raw data flows through refinement stages until it's ready for business use. What could go wrong?
Bronze layer pitfalls
#1: Overloading the bronze layer
The problem: Teams dump everything into Bronze - raw data, semi-processed files, reprocessed corrections, backfills, and "emergency fixes."
# What Bronze becomes:bronze/ raw_events/ # Original ingestion raw_events_v2/ # "Fixed" version raw_events_backfill/ # Historical load raw_events_hotfix/ # Manual correction # Which one is canonical?
The effect:
- Hard to tell what's the source of truth
- Duplicate ingestion and storage bloat
- Queries scan multiple directories unsure which to use
- Pipeline logic becomes "which Bronze table do I read from?"
The fix: Enforce a clear policy:
# Bronze = raw + immutable + append-only# One rule: If it has ANY transformation logic, it's not Bronzebronze_orders = ( spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .load("s3://source/orders/"//source/orders/") .withColumn("_ingested_at", current_timestamp()) .withColumn("_source_file", input_file_name()) .writeStream .format("delta") .option("checkpointLocation", "s3://lake/checkpoints/bronze_orders"//lake/checkpoints/bronze_orders") .trigger(availableNow=True) .toTable("bronze.orders") # Append-only, immutable)
#2: Schema drift and bad data
The problem: Source systems evolve - extra columns appear, fields get renamed, data types change. Bronze just stores whatever arrives, and Silver ETL jobs start breaking.
# Monday: Schema looks finebronze_users = spark.read.table("bronze.users")# Columns: user_id, email, created_at# Tuesday: Source system adds phone_number# Wednesday: Your Silver job failssilver_users = bronze_users.select("user_id", "email", "created_at")# Error: Unknown column 'phone_number' breaks the schema
The effect:
- Pipelines fail mysteriously when source schemas change
- Data quality issues propagate silently
- Can't tell when schema evolved vs. when it broke
The fix: Use schema evolution and versioning:
# Option 1: Delta Lake schema evolutionbronze_users = ( spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", "s3://schemas/users/"//schemas/users/") # Track schema changes .option("cloudFiles.inferColumnTypes", "true") .load("s3://source/users/"//source/users/") .writeStream .format("delta") .option("mergeSchema", "true") # Allow schema evolution .toTable("bronze.users"))# Option 2: Schema validation with fallbackfrom pyspark.sql.types import StructType, StructField, StringType, TimestampTypeexpected_schema = StructType([ StructField("user_id", StringType(), False), StructField("email", StringType(), True), StructField("created_at", TimestampType(), True)])bronze_raw = spark.read.json("s3://source/users/"//source/users/")# Check for schema driftactual_columns = set(bronze_raw.columns)expected_columns = set([f.name for f in expected_schema.fields])new_columns = actual_columns - expected_columnsif new_columns: log_schema_change(new_columns) # Alert data team # Store everything, extra columns preserved bronze_raw.write.mode("append").saveAsTable("bronze.users")
#3: Storage cost explosion
The problem: Three copies of every byte adds up fast.
Let's do some math with a typical data platform storing 100TB:
Approach | Layer | Retention | Size | Monthly Cost (S3) |
---|---|---|---|---|
Full Medallion | Bronze (2 years) | 2 years | 200 TB | $4,600 |
Silver (2 years) | 2 years | 200 TB | $4,600 | |
Gold (2 years) | 2 years | 200 TB | $4,600 | |
Total | 600 TB | $13,800/month | ||
Optimized | Bronze (90 days) | 90 days | 25 TB | $575 |
Silver (1 year) | 1 year | 100 TB | $2,300 | |
Gold (2 years) | 2 years | 200 TB | $4,600 | |
Total | 325 TB | $7,475/month |
The "optimized" approach saves over $6,000/month (46% reduction) by keeping Gold data longer (where it matters for business) and Bronze shorter (where it's just insurance).
The reality: Bronze is often just insurance. Ask yourself:
- How often do you actually reprocess from Bronze?
- Could you re-ingest from the source system if needed?
- Does keeping raw JSON for years serve any purpose?
Better approach: Implement tiered retention based on business value:
# Lifecycle policy examplelifecycle_rules = { "bronze": { "hot": "30 days", # S3 Standard "warm": "60 days", # S3 Infrequent Access "archive": "90 days", # S3 Glacier "delete": "180 days" # Gone }, "silver": { "hot": "1 year", # Query-optimized "archive": "3 years" # Compliance }, "gold": { "hot": "2 years", # Business-critical, optimized for queries "no_deletion": True # Keep indefinitely for historical analysis }}
Silver layer pitfalls
#4: Transformation bloat
The problem: Cramming all cleansing, enrichment, joins, aggregations, and validation into a single monster job.
# The 1000-line Silver transformation job from helldef create_silver_user_metrics(spark): # Step 1: Read from 5 Bronze sources # Step 2: Clean and standardize (200 lines) # Step 3: Join everything (150 lines) # Step 4: Enrich with external APIs (100 lines) # Step 5: Validate (50 lines) # Step 6: Aggregate (200 lines) # Step 7: Apply complex business logic (300 lines) # If ANY step fails, debug all 1000 lines
The effect:
- Complex DAGs with poor modularity
- Long runtimes (can't optimize individual steps)
- Difficult to test, debug, or parallelize
- One small change requires full redeployment
The fix: Split transformations into logical stages:
# Bronze → Silver: Cleanse and standardize ONLYdef bronze_to_silver_orders(spark): return ( spark.read.table("bronze.orders") .filter(col("order_id").isNotNull()) # Basic validation .withColumn("email", lower(trim(col("email")))) # Standardize .withColumn("amount", col("amount").cast("decimal(10,2)")) # Type safety .dropDuplicates(["order_id", "order_timestamp"]) # Dedupe .write.mode("overwrite").saveAsTable("silver.orders") )# Silver → Silver: Enrichment (separate job)def enrich_silver_orders(spark): orders = spark.read.table("silver.orders") customers = spark.read.table("silver.customers") return ( orders.join(customers, "customer_id", "left") .write.mode("overwrite").saveAsTable("silver.orders_enriched") )# Silver → Gold: Aggregate and serve (separate job)def create_gold_daily_revenue(spark): return ( spark.read.table("silver.orders_enriched") .groupBy("order_date", "region") .agg(sum("amount").alias("revenue")) .write.mode("overwrite").saveAsTable("gold.daily_revenue") )
#5: Inconsistent keys and deduplication
The problem: Different pipelines use different surrogate key strategies and deduplication logic.
# Pipeline A: Uses MD5 hashsilver_users_a = bronze.withColumn("user_key", md5(concat("email", "signup_date")))# Pipeline B: Uses UUIDsilver_users_b = bronze.withColumn("user_key", expr("uuid()"))# Pipeline C: Uses source system IDsilver_users_c = bronze.withColumn("user_key", col("source_user_id"))# Now joins between Silver tables are inconsistent# Customer support: "Why does user X have 3 different IDs?"
The effect:
- Joins behave inconsistently
- Upserts fail silently
- Duplicate records proliferate
- Data quality reports are unreliable
The fix: Centralize key generation in Silver:
# Create a canonical key management moduledef generate_canonical_key(source_system, natural_keys): """ Centralized key generation for all Silver tables. Args: source_system: Source system name (e.g., 'salesforce', 'mysql') natural_keys: Business keys from source (list of column names) """ key_columns = [lit(source_system)] + natural_keys return sha2(concat_ws("||", *key_columns), 256)# Use consistently across all Silver tablessilver_customers = ( bronze.customers .withColumn("customer_key", generate_canonical_key("salesforce", [col("account_id")]) ))silver_orders = ( bronze.orders .withColumn("order_key", generate_canonical_key("shopify", [col("order_id")]) ) .withColumn("customer_key", # Use same key for joins generate_canonical_key("salesforce", [col("customer_account_id")]) ))# Document key strategy in data catalog"""Silver Key Standards:- All Silver tables have a {entity}_key column- Format: SHA256(source_system || natural_key_1 || natural_key_2 || ...)- Deduplication: Always use row_number() over (partition by {entity}_key order by _updated_at desc)"""
Gold layer pitfalls
#6: Business logic drift
The problem: Analysts encode business logic (KPIs, dimensions) differently across Gold tables and dashboards.
# Data Engineer's definition of "Active User"gold_active_users = users.filter("last_login > current_date() - 30")# Analyst A's definition in Tableau# "Active User" = login in last 7 days# Analyst B's definition in Python# "Active User" = made a purchase in last 14 days# CFO asks: "How many active users do we have?"# Three different answers, three different people, same meeting
The effect:
- "One metric, five definitions" chaos
- Stakeholder confusion and mistrust
- Time wasted reconciling different numbers
- Compliance risks (especially for financial metrics)
The fix: Centralize metrics in a semantic layer:
# Option 1: [dbt metrics](https://docs.getdbt.com/docs/build/metrics)# models/metrics.ymlmetrics: - name: active_users label: Active Users (30-day) model: ref('gold_users') calculation_method: count_distinct expression: user_id timestamp: activity_date time_grains: [day, week, month] dimensions: - region - user_type filters: - field: last_activity_date operator: '>' value: "current_date - interval '30 days'" meta: owner: data-team@company.com definition: "Users with any activity in the last 30 days"# Option 2: Build a metrics layer in GoldCREATE OR REPLACE VIEW gold.metrics_active_users ASWITH user_activity AS ( SELECT user_id, MAX(activity_date) as last_activity_date FROM silver.user_events GROUP BY user_id),active_users AS ( SELECT COUNT(DISTINCT user_id) as active_user_count, CURRENT_DATE() as metric_date FROM user_activity WHERE last_activity_date > CURRENT_DATE() - INTERVAL '30 days')SELECT * FROM active_users;-- Document the metric definitionCOMMENT ON VIEW gold.metrics_active_users IS'Official definition of Active Users: Users with any activity in last 30 days.Owner: Data Team (data-team@company.com)Updated: Daily at 6 AM UTCDependencies: silver.user_eventsDo not create alternate definitions - use this view as source of truth.';
#7: Materialization explosion
The problem: Every analyst or team builds their own "Gold table" for their specific use case.
# After 6 months in production:gold.sales_dashboard_v1gold.sales_dashboard_v2gold.sales_dashboard_finalgold.sales_dashboard_actually_finalgold.sales_team_metricsgold.sales_executive_summarygold.sales_daily # Updated dailygold.sales_hourly # Updated hourly but has same data as sales_dailygold.marketing_sales # Duplicate of sales_team_metrics with 2 extra columns# ...100+ more Gold tables
The effect:
- Massive storage bloat
- Inconsistent refresh schedules
- Redundant compute
- Confusion about which table to use
- Maintenance nightmare
The fix: Govern the Gold layer with data contracts:
# Require approval for new Gold tables via PR process# gold_table_registry.yamlgold_tables: - name: gold.daily_revenue owner: finance_team refresh: daily_6am consumers: - finance_dashboard - executive_report retention: 2_years approved_by: data_governance_team created_date: 2024-01-15 - name: gold.user_segments owner: marketing_team refresh: weekly_monday consumers: - email_campaigns - marketing_dashboard retention: 1_year approved_by: data_governance_team created_date: 2024-02-01# Automated check in CI/CDdef validate_gold_table_request(table_name, definition): registry = load_registry("gold_table_registry.yaml") # Check if similar table exists similar_tables = find_similar_tables(table_name, registry) if similar_tables: raise Exception( f"Similar Gold tables already exist: {similar_tables}\n" f"Consider reusing or extending existing table instead." ) # Require approval if not has_governance_approval(table_name): raise Exception( f"Gold table {table_name} requires data governance approval.\n" f"Submit request via: https://data-governance.company.com/request" )
Operational and cross-cutting pitfalls
#8: Lack of incremental processing
The problem: Full refreshes of Silver/Gold tables even when only 1% of data changes.
# Every night: Full refresh of 10TB tablesilver_orders = ( spark.read.table("bronze.orders") # Read ALL Bronze data .transform(clean_and_dedupe) .write .mode("overwrite") # Rewrite entire table .saveAsTable("silver.orders"))# Cost: $500/night in compute# Time: 4 hours# Actual new data: 50GB (0.5%)
The effect:
- High compute costs
- Long runtimes
- Job failures (can't finish in batch window)
- Unnecessary load on source systems
The fix: Implement incremental processing:
# CDC pattern with mergefrom delta.tables import DeltaTabledef incremental_bronze_to_silver(table_name, watermark_column="_ingested_at"): # Get last processed watermark last_watermark = ( spark.read.table(f"silver.{table_name}") .selectExpr(f"MAX({watermark_column}) as max_watermark") .collect()[0]["max_watermark"] ) # Read only new Bronze data new_bronze_data = ( spark.read.table(f"bronze.{table_name}") .filter(col(watermark_column) > last_watermark) # Incremental .transform(clean_and_dedupe) ) # Merge into Silver silver_table = DeltaTable.forName(spark, f"silver.{table_name}") silver_table.alias("target").merge( new_bronze_data.alias("source"), "target.id = source.id" ).whenMatchedUpdateAll( ).whenNotMatchedInsertAll( ).execute()# Cost: $10/night (50GB incremental)# Time: 15 minutes# Same result as full refresh
#9: No late-arriving data handling
The problem: Data arriving out-of-order or re-sent isn't handled properly.
# Monday: Process orders for March 15gold_daily_revenue = orders.filter("order_date = '2024-03-15'").agg(sum("amount"))# Result: $100,000# Wednesday: Late orders for March 15 arrive# But your job only processes "new" data from Wednesday forward# March 15 revenue is still $100,000 (wrong!)# Actual March 15 revenue: $105,000
The effect:
- Downstream aggregates become stale or incorrect
- Historical data is wrong
- Reports show different numbers depending on when they ran
- Audits fail
The fix: Implement watermarks and idempotent merges:
# Set watermark for late-arriving data (e.g., allow 7 days late)LATE_ARRIVAL_WINDOW_DAYS = 7def process_with_late_arrival_handling(business_date): # Reprocess window: today minus late arrival window reprocess_start = business_date - timedelta(days=LATE_ARRIVAL_WINDOW_DAYS) # Read Silver data for reprocessing window orders_to_process = ( spark.read.table("silver.orders") .filter( (col("order_date") >= reprocess_start) & (col("order_date") <= business_date) ) ) # Aggregate by business date daily_revenue = ( orders_to_process .groupBy("order_date", "region") .agg( sum("amount").alias("revenue"), count("order_id").alias("order_count"), current_timestamp().alias("_calculated_at") ) ) # Merge into Gold (idempotent - safe to rerun) gold_table = DeltaTable.forName(spark, "gold.daily_revenue") gold_table.alias("target").merge( daily_revenue.alias("source"), "target.order_date = source.order_date AND target.region = source.region" ).whenMatchedUpdate( set={ "revenue": "source.revenue", "order_count": "source.order_count", "_calculated_at": "source._calculated_at" } ).whenNotMatchedInsert( values={ "order_date": "source.order_date", "region": "source.region", "revenue": "source.revenue", "order_count": "source.order_count", "_calculated_at": "source._calculated_at" } ).execute()# Now late-arriving data for March 15 updates the existing record# Idempotent: Running twice produces same result
#10: Weak data quality contracts
The problem: ETL assumes "Bronze is fine" - but it's not. Nulls, bad joins, and wrong types propagate silently.
# Silver job assumes Bronze is cleansilver_orders = ( bronze_orders .join(bronze_customers, "customer_id") # Assumes customer_id exists .selectExpr("order_id", "customer_name", "CAST(amount AS DOUBLE)") # Assumes amount is numeric)# Reality:# - 15% of bronze_orders have null customer_id → Join silently drops them# - 2% of amounts are "N/A" → Cast returns null# - Nobody notices until the CEO asks why revenue is down 17%
The effect:
- Silent data loss
- Incorrect aggregations
- Broken dashboards discovered in stakeholder meetings
- Loss of trust in data platform
The fix: Implement expectations at every layer boundary:
# Option 1: [Great Expectations](https://greatexpectations.io/)import great_expectations as gxdef validate_bronze_orders(df): expectation_suite = gx.core.ExpectationSuite(expectation_suite_name="bronze_orders") # Define expectations expectations = [ {"expectation_type": "expect_column_to_exist", "column": "order_id"}, {"expectation_type": "expect_column_values_to_not_be_null", "column": "order_id"}, {"expectation_type": "expect_column_values_to_be_unique", "column": "order_id"}, {"expectation_type": "expect_column_values_to_be_between", "column": "amount", "min_value": 0, "max_value": 100000}, {"expectation_type": "expect_column_values_to_not_be_null", "column": "customer_id", "mostly": 0.95}, # Allow 5% null ] # Validate results = df.validate(expectation_suite) if not results["success"]: alert_data_quality_failure(results) raise DataQualityException(f"Bronze validation failed: {results}") return df# Option 2: dbt tests# models/silver/orders.sql{{ config( materialized='table' )}}SELECT * FROM {{ ref('bronze_orders') }}-- tests/silver/orders_tests.ymlversion: 2models: - name: orders columns: - name: order_id tests: - not_null - unique - name: amount tests: - not_null - dbt_utils.accepted_range: min_value: 0 max_value: 1000000 - name: customer_id tests: - not_null - relationships: to: ref('silver_customers') field: customer_id# Option 3: [Delta table constraints](https://docs.delta.io/latest/delta-constraints.html)spark.sql(""" ALTER TABLE silver.orders ADD CONSTRAINT valid_amount CHECK (amount >= 0); ALTER TABLE silver.orders ADD CONSTRAINT valid_order_id CHECK (order_id IS NOT NULL); ALTER TABLE silver.orders ADD CONSTRAINT valid_customer FOREIGN KEY (customer_id) REFERENCES silver.customers(customer_id);""")# Violations automatically rejected at write time
When medallion architecture works best
Despite these pitfalls, medallion architecture is excellent for:
Use Case | Why It Works |
---|---|
Complex, messy source data | Bronze provides raw backup; Silver does heavy lifting |
Multiple consumers with different needs | Gold layer can serve many use cases from one Silver layer |
Regulated industries | Bronze provides audit trail; clear transformation lineage |
Machine learning pipelines | Silver provides clean features; Gold provides labeled datasets |
Data with complex business rules | Silver enforces rules; Gold optimizes for access patterns |
Decision framework: Do you need all three layers?
Consideration | Bronze → Silver → Gold | Bronze → Gold | Silver → Gold |
---|---|---|---|
Source data quality | Messy, evolving schema | Clean, structured | Already in lake |
Transformation complexity | High (5+ steps) | Low (1-2 steps) | Medium |
Reprocessing needs | Frequent | Rare | N/A |
Compliance/audit | Required | Not required | Partial |
Multiple consumers | Yes (5+) | Single use case | Multiple |
Latency requirements | Batch (hours) | Real-time | Mixed |
Storage budget | Generous | Constrained | Constrained |
The medallion architecture is a powerful pattern - but it's a starting point, not a mandate. The best data architectures are:
- Purpose-driven: Each layer solves a specific problem
- Cost-conscious: Don't store or compute "just in case"
- Quality-first: Test and validate at every boundary
- Incremental: Process only what changed
- Governed: Control Gold layer sprawl
- Observable: Track lineage and dependencies
- Evolvable: Handle schema changes gracefully
Remember: Bronze, Silver, and Gold are colors, not commandments. Your architecture should serve your business requirements, not the other way around.
The best data engineers know when to follow the pattern - and when to break it.
Related reading:
- Data Platform Observability Best Practices - Essential monitoring for medallion architectures
- Hidden Costs of Data Platform Technical Debt - Avoiding architectural debt accumulation
Struggling with data architecture decisions or dealing with medallion architecture pitfalls? Our team at Black Dog Labs helps companies design pragmatic, cost-effective data platforms that balance best practices with real-world constraints. Contact us to discuss your challenges.