Black Dog Labs
Back to Blog

When the Medallion Shines Too Bright

The medallion architecture has become the default pattern for data lakes, but it's not a silver bullet. Learn the real-world pitfalls teams encounter and how to avoid them.

Black Dog Labs Team
3/14/2024
14 min read
data-architecturedata-engineeringbest-practicesmedallion-architecture

When the Medallion Shines Too Bright: Ten Pitfalls of Bronze-Silver-Gold Architecture

The medallion architecture (Bronze → Silver → Gold) has become the de facto standard for organizing data lakes. It's elegant, intuitive, and backed by major players like Databricks. But like any architectural pattern, blindly applying it without understanding its trade-offs can lead to complexity, waste, and frustrated data teams.

Let's explore the real pitfalls teams encounter in production - and how to avoid them.

The medallion promise

First, let's acknowledge why the medallion architecture became popular:

LayerPurposeBenefits
BronzeRaw data ingestionComplete audit trail, reprocessing capability
SilverCleaned, deduplicated dataStandardized schemas, quality enforcement
GoldBusiness-ready aggregatesOptimized for consumption, denormalized

This progression makes intuitive sense. Raw data flows through refinement stages until it's ready for business use. What could go wrong?

Bronze layer pitfalls

#1: Overloading the bronze layer

The problem: Teams dump everything into Bronze - raw data, semi-processed files, reprocessed corrections, backfills, and "emergency fixes."

# What Bronze becomes:
bronze/
raw_events/ # Original ingestion
raw_events_v2/ # "Fixed" version
raw_events_backfill/ # Historical load
raw_events_hotfix/ # Manual correction
# Which one is canonical?

The effect:

  • Hard to tell what's the source of truth
  • Duplicate ingestion and storage bloat
  • Queries scan multiple directories unsure which to use
  • Pipeline logic becomes "which Bronze table do I read from?"

The fix: Enforce a clear policy:

# Bronze = raw + immutable + append-only
# One rule: If it has ANY transformation logic, it's not Bronze
bronze_orders = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("s3://source/orders/"//source/orders/")
.withColumn("_ingested_at", current_timestamp())
.withColumn("_source_file", input_file_name())
.writeStream
.format("delta")
.option("checkpointLocation", "s3://lake/checkpoints/bronze_orders"//lake/checkpoints/bronze_orders")
.trigger(availableNow=True)
.toTable("bronze.orders") # Append-only, immutable
)

#2: Schema drift and bad data

The problem: Source systems evolve - extra columns appear, fields get renamed, data types change. Bronze just stores whatever arrives, and Silver ETL jobs start breaking.

# Monday: Schema looks fine
bronze_users = spark.read.table("bronze.users")
# Columns: user_id, email, created_at
# Tuesday: Source system adds phone_number
# Wednesday: Your Silver job fails
silver_users = bronze_users.select("user_id", "email", "created_at")
# Error: Unknown column 'phone_number' breaks the schema

The effect:

  • Pipelines fail mysteriously when source schemas change
  • Data quality issues propagate silently
  • Can't tell when schema evolved vs. when it broke

The fix: Use schema evolution and versioning:

# Option 1: Delta Lake schema evolution
bronze_users = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "s3://schemas/users/"//schemas/users/") # Track schema changes
.option("cloudFiles.inferColumnTypes", "true")
.load("s3://source/users/"//source/users/")
.writeStream
.format("delta")
.option("mergeSchema", "true") # Allow schema evolution
.toTable("bronze.users")
)
# Option 2: Schema validation with fallback
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
expected_schema = StructType([
StructField("user_id", StringType(), False),
StructField("email", StringType(), True),
StructField("created_at", TimestampType(), True)
])
bronze_raw = spark.read.json("s3://source/users/"//source/users/")
# Check for schema drift
actual_columns = set(bronze_raw.columns)
expected_columns = set([f.name for f in expected_schema.fields])
new_columns = actual_columns - expected_columns
if new_columns:
log_schema_change(new_columns) # Alert data team
# Store everything, extra columns preserved
bronze_raw.write.mode("append").saveAsTable("bronze.users")

#3: Storage cost explosion

The problem: Three copies of every byte adds up fast.

Let's do some math with a typical data platform storing 100TB:

ApproachLayerRetentionSizeMonthly Cost (S3)
Full MedallionBronze (2 years)2 years200 TB$4,600
Silver (2 years)2 years200 TB$4,600
Gold (2 years)2 years200 TB$4,600
Total600 TB$13,800/month
OptimizedBronze (90 days)90 days25 TB$575
Silver (1 year)1 year100 TB$2,300
Gold (2 years)2 years200 TB$4,600
Total325 TB$7,475/month

The "optimized" approach saves over $6,000/month (46% reduction) by keeping Gold data longer (where it matters for business) and Bronze shorter (where it's just insurance).

The reality: Bronze is often just insurance. Ask yourself:

  • How often do you actually reprocess from Bronze?
  • Could you re-ingest from the source system if needed?
  • Does keeping raw JSON for years serve any purpose?

Better approach: Implement tiered retention based on business value:

# Lifecycle policy example
lifecycle_rules = {
"bronze": {
"hot": "30 days", # S3 Standard
"warm": "60 days", # S3 Infrequent Access
"archive": "90 days", # S3 Glacier
"delete": "180 days" # Gone
},
"silver": {
"hot": "1 year", # Query-optimized
"archive": "3 years" # Compliance
},
"gold": {
"hot": "2 years", # Business-critical, optimized for queries
"no_deletion": True # Keep indefinitely for historical analysis
}
}

Silver layer pitfalls

#4: Transformation bloat

The problem: Cramming all cleansing, enrichment, joins, aggregations, and validation into a single monster job.

# The 1000-line Silver transformation job from hell
def create_silver_user_metrics(spark):
# Step 1: Read from 5 Bronze sources
# Step 2: Clean and standardize (200 lines)
# Step 3: Join everything (150 lines)
# Step 4: Enrich with external APIs (100 lines)
# Step 5: Validate (50 lines)
# Step 6: Aggregate (200 lines)
# Step 7: Apply complex business logic (300 lines)
# If ANY step fails, debug all 1000 lines

The effect:

  • Complex DAGs with poor modularity
  • Long runtimes (can't optimize individual steps)
  • Difficult to test, debug, or parallelize
  • One small change requires full redeployment

The fix: Split transformations into logical stages:

# Bronze → Silver: Cleanse and standardize ONLY
def bronze_to_silver_orders(spark):
return (
spark.read.table("bronze.orders")
.filter(col("order_id").isNotNull()) # Basic validation
.withColumn("email", lower(trim(col("email")))) # Standardize
.withColumn("amount", col("amount").cast("decimal(10,2)")) # Type safety
.dropDuplicates(["order_id", "order_timestamp"]) # Dedupe
.write.mode("overwrite").saveAsTable("silver.orders")
)
# Silver → Silver: Enrichment (separate job)
def enrich_silver_orders(spark):
orders = spark.read.table("silver.orders")
customers = spark.read.table("silver.customers")
return (
orders.join(customers, "customer_id", "left")
.write.mode("overwrite").saveAsTable("silver.orders_enriched")
)
# Silver → Gold: Aggregate and serve (separate job)
def create_gold_daily_revenue(spark):
return (
spark.read.table("silver.orders_enriched")
.groupBy("order_date", "region")
.agg(sum("amount").alias("revenue"))
.write.mode("overwrite").saveAsTable("gold.daily_revenue")
)

#5: Inconsistent keys and deduplication

The problem: Different pipelines use different surrogate key strategies and deduplication logic.

# Pipeline A: Uses MD5 hash
silver_users_a = bronze.withColumn("user_key", md5(concat("email", "signup_date")))
# Pipeline B: Uses UUID
silver_users_b = bronze.withColumn("user_key", expr("uuid()"))
# Pipeline C: Uses source system ID
silver_users_c = bronze.withColumn("user_key", col("source_user_id"))
# Now joins between Silver tables are inconsistent
# Customer support: "Why does user X have 3 different IDs?"

The effect:

  • Joins behave inconsistently
  • Upserts fail silently
  • Duplicate records proliferate
  • Data quality reports are unreliable

The fix: Centralize key generation in Silver:

# Create a canonical key management module
def generate_canonical_key(source_system, natural_keys):
"""
Centralized key generation for all Silver tables.
Args:
source_system: Source system name (e.g., 'salesforce', 'mysql')
natural_keys: Business keys from source (list of column names)
"""
key_columns = [lit(source_system)] + natural_keys
return sha2(concat_ws("||", *key_columns), 256)
# Use consistently across all Silver tables
silver_customers = (
bronze.customers
.withColumn("customer_key",
generate_canonical_key("salesforce", [col("account_id")])
)
)
silver_orders = (
bronze.orders
.withColumn("order_key",
generate_canonical_key("shopify", [col("order_id")])
)
.withColumn("customer_key", # Use same key for joins
generate_canonical_key("salesforce", [col("customer_account_id")])
)
)
# Document key strategy in data catalog
"""
Silver Key Standards:
- All Silver tables have a {entity}_key column
- Format: SHA256(source_system || natural_key_1 || natural_key_2 || ...)
- Deduplication: Always use row_number() over (partition by {entity}_key order by _updated_at desc)
"""

Gold layer pitfalls

#6: Business logic drift

The problem: Analysts encode business logic (KPIs, dimensions) differently across Gold tables and dashboards.

# Data Engineer's definition of "Active User"
gold_active_users = users.filter("last_login > current_date() - 30")
# Analyst A's definition in Tableau
# "Active User" = login in last 7 days
# Analyst B's definition in Python
# "Active User" = made a purchase in last 14 days
# CFO asks: "How many active users do we have?"
# Three different answers, three different people, same meeting

The effect:

  • "One metric, five definitions" chaos
  • Stakeholder confusion and mistrust
  • Time wasted reconciling different numbers
  • Compliance risks (especially for financial metrics)

The fix: Centralize metrics in a semantic layer:

# Option 1: [dbt metrics](https://docs.getdbt.com/docs/build/metrics)
# models/metrics.yml
metrics:
- name: active_users
label: Active Users (30-day)
model: ref('gold_users')
calculation_method: count_distinct
expression: user_id
timestamp: activity_date
time_grains: [day, week, month]
dimensions:
- region
- user_type
filters:
- field: last_activity_date
operator: '>'
value: "current_date - interval '30 days'"
meta:
owner: data-team@company.com
definition: "Users with any activity in the last 30 days"
# Option 2: Build a metrics layer in Gold
CREATE OR REPLACE VIEW gold.metrics_active_users AS
WITH user_activity AS (
SELECT
user_id,
MAX(activity_date) as last_activity_date
FROM silver.user_events
GROUP BY user_id
),
active_users AS (
SELECT
COUNT(DISTINCT user_id) as active_user_count,
CURRENT_DATE() as metric_date
FROM user_activity
WHERE last_activity_date > CURRENT_DATE() - INTERVAL '30 days'
)
SELECT * FROM active_users;
-- Document the metric definition
COMMENT ON VIEW gold.metrics_active_users IS
'Official definition of Active Users: Users with any activity in last 30 days.
Owner: Data Team (data-team@company.com)
Updated: Daily at 6 AM UTC
Dependencies: silver.user_events
Do not create alternate definitions - use this view as source of truth.';

#7: Materialization explosion

The problem: Every analyst or team builds their own "Gold table" for their specific use case.

# After 6 months in production:
gold.sales_dashboard_v1
gold.sales_dashboard_v2
gold.sales_dashboard_final
gold.sales_dashboard_actually_final
gold.sales_team_metrics
gold.sales_executive_summary
gold.sales_daily # Updated daily
gold.sales_hourly # Updated hourly but has same data as sales_daily
gold.marketing_sales # Duplicate of sales_team_metrics with 2 extra columns
# ...100+ more Gold tables

The effect:

  • Massive storage bloat
  • Inconsistent refresh schedules
  • Redundant compute
  • Confusion about which table to use
  • Maintenance nightmare

The fix: Govern the Gold layer with data contracts:

# Require approval for new Gold tables via PR process
# gold_table_registry.yaml
gold_tables:
- name: gold.daily_revenue
owner: finance_team
refresh: daily_6am
consumers:
- finance_dashboard
- executive_report
retention: 2_years
approved_by: data_governance_team
created_date: 2024-01-15
- name: gold.user_segments
owner: marketing_team
refresh: weekly_monday
consumers:
- email_campaigns
- marketing_dashboard
retention: 1_year
approved_by: data_governance_team
created_date: 2024-02-01
# Automated check in CI/CD
def validate_gold_table_request(table_name, definition):
registry = load_registry("gold_table_registry.yaml")
# Check if similar table exists
similar_tables = find_similar_tables(table_name, registry)
if similar_tables:
raise Exception(
f"Similar Gold tables already exist: {similar_tables}\n"
f"Consider reusing or extending existing table instead."
)
# Require approval
if not has_governance_approval(table_name):
raise Exception(
f"Gold table {table_name} requires data governance approval.\n"
f"Submit request via: https://data-governance.company.com/request"
)

Operational and cross-cutting pitfalls

#8: Lack of incremental processing

The problem: Full refreshes of Silver/Gold tables even when only 1% of data changes.

# Every night: Full refresh of 10TB table
silver_orders = (
spark.read.table("bronze.orders") # Read ALL Bronze data
.transform(clean_and_dedupe)
.write
.mode("overwrite") # Rewrite entire table
.saveAsTable("silver.orders")
)
# Cost: $500/night in compute
# Time: 4 hours
# Actual new data: 50GB (0.5%)

The effect:

  • High compute costs
  • Long runtimes
  • Job failures (can't finish in batch window)
  • Unnecessary load on source systems

The fix: Implement incremental processing:

# CDC pattern with merge
from delta.tables import DeltaTable
def incremental_bronze_to_silver(table_name, watermark_column="_ingested_at"):
# Get last processed watermark
last_watermark = (
spark.read.table(f"silver.{table_name}")
.selectExpr(f"MAX({watermark_column}) as max_watermark")
.collect()[0]["max_watermark"]
)
# Read only new Bronze data
new_bronze_data = (
spark.read.table(f"bronze.{table_name}")
.filter(col(watermark_column) > last_watermark) # Incremental
.transform(clean_and_dedupe)
)
# Merge into Silver
silver_table = DeltaTable.forName(spark, f"silver.{table_name}")
silver_table.alias("target").merge(
new_bronze_data.alias("source"),
"target.id = source.id"
).whenMatchedUpdateAll(
).whenNotMatchedInsertAll(
).execute()
# Cost: $10/night (50GB incremental)
# Time: 15 minutes
# Same result as full refresh

#9: No late-arriving data handling

The problem: Data arriving out-of-order or re-sent isn't handled properly.

# Monday: Process orders for March 15
gold_daily_revenue = orders.filter("order_date = '2024-03-15'").agg(sum("amount"))
# Result: $100,000
# Wednesday: Late orders for March 15 arrive
# But your job only processes "new" data from Wednesday forward
# March 15 revenue is still $100,000 (wrong!)
# Actual March 15 revenue: $105,000

The effect:

  • Downstream aggregates become stale or incorrect
  • Historical data is wrong
  • Reports show different numbers depending on when they ran
  • Audits fail

The fix: Implement watermarks and idempotent merges:

# Set watermark for late-arriving data (e.g., allow 7 days late)
LATE_ARRIVAL_WINDOW_DAYS = 7
def process_with_late_arrival_handling(business_date):
# Reprocess window: today minus late arrival window
reprocess_start = business_date - timedelta(days=LATE_ARRIVAL_WINDOW_DAYS)
# Read Silver data for reprocessing window
orders_to_process = (
spark.read.table("silver.orders")
.filter(
(col("order_date") >= reprocess_start) &
(col("order_date") <= business_date)
)
)
# Aggregate by business date
daily_revenue = (
orders_to_process
.groupBy("order_date", "region")
.agg(
sum("amount").alias("revenue"),
count("order_id").alias("order_count"),
current_timestamp().alias("_calculated_at")
)
)
# Merge into Gold (idempotent - safe to rerun)
gold_table = DeltaTable.forName(spark, "gold.daily_revenue")
gold_table.alias("target").merge(
daily_revenue.alias("source"),
"target.order_date = source.order_date AND target.region = source.region"
).whenMatchedUpdate(
set={
"revenue": "source.revenue",
"order_count": "source.order_count",
"_calculated_at": "source._calculated_at"
}
).whenNotMatchedInsert(
values={
"order_date": "source.order_date",
"region": "source.region",
"revenue": "source.revenue",
"order_count": "source.order_count",
"_calculated_at": "source._calculated_at"
}
).execute()
# Now late-arriving data for March 15 updates the existing record
# Idempotent: Running twice produces same result

#10: Weak data quality contracts

The problem: ETL assumes "Bronze is fine" - but it's not. Nulls, bad joins, and wrong types propagate silently.

# Silver job assumes Bronze is clean
silver_orders = (
bronze_orders
.join(bronze_customers, "customer_id") # Assumes customer_id exists
.selectExpr("order_id", "customer_name", "CAST(amount AS DOUBLE)") # Assumes amount is numeric
)
# Reality:
# - 15% of bronze_orders have null customer_id → Join silently drops them
# - 2% of amounts are "N/A" → Cast returns null
# - Nobody notices until the CEO asks why revenue is down 17%

The effect:

  • Silent data loss
  • Incorrect aggregations
  • Broken dashboards discovered in stakeholder meetings
  • Loss of trust in data platform

The fix: Implement expectations at every layer boundary:

# Option 1: [Great Expectations](https://greatexpectations.io/)
import great_expectations as gx
def validate_bronze_orders(df):
expectation_suite = gx.core.ExpectationSuite(expectation_suite_name="bronze_orders")
# Define expectations
expectations = [
{"expectation_type": "expect_column_to_exist", "column": "order_id"},
{"expectation_type": "expect_column_values_to_not_be_null", "column": "order_id"},
{"expectation_type": "expect_column_values_to_be_unique", "column": "order_id"},
{"expectation_type": "expect_column_values_to_be_between",
"column": "amount", "min_value": 0, "max_value": 100000},
{"expectation_type": "expect_column_values_to_not_be_null",
"column": "customer_id", "mostly": 0.95}, # Allow 5% null
]
# Validate
results = df.validate(expectation_suite)
if not results["success"]:
alert_data_quality_failure(results)
raise DataQualityException(f"Bronze validation failed: {results}")
return df
# Option 2: dbt tests
# models/silver/orders.sql
{{
config(
materialized='table'
)
}}
SELECT * FROM {{ ref('bronze_orders') }}
-- tests/silver/orders_tests.yml
version: 2
models:
- name: orders
columns:
- name: order_id
tests:
- not_null
- unique
- name: amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
- name: customer_id
tests:
- not_null
- relationships:
to: ref('silver_customers')
field: customer_id
# Option 3: [Delta table constraints](https://docs.delta.io/latest/delta-constraints.html)
spark.sql("""
ALTER TABLE silver.orders ADD CONSTRAINT valid_amount CHECK (amount >= 0);
ALTER TABLE silver.orders ADD CONSTRAINT valid_order_id CHECK (order_id IS NOT NULL);
ALTER TABLE silver.orders ADD CONSTRAINT valid_customer
FOREIGN KEY (customer_id) REFERENCES silver.customers(customer_id);
""")
# Violations automatically rejected at write time

When medallion architecture works best

Despite these pitfalls, medallion architecture is excellent for:

Use CaseWhy It Works
Complex, messy source dataBronze provides raw backup; Silver does heavy lifting
Multiple consumers with different needsGold layer can serve many use cases from one Silver layer
Regulated industriesBronze provides audit trail; clear transformation lineage
Machine learning pipelinesSilver provides clean features; Gold provides labeled datasets
Data with complex business rulesSilver enforces rules; Gold optimizes for access patterns

Decision framework: Do you need all three layers?

ConsiderationBronze → Silver → GoldBronze → GoldSilver → Gold
Source data qualityMessy, evolving schemaClean, structuredAlready in lake
Transformation complexityHigh (5+ steps)Low (1-2 steps)Medium
Reprocessing needsFrequentRareN/A
Compliance/auditRequiredNot requiredPartial
Multiple consumersYes (5+)Single use caseMultiple
Latency requirementsBatch (hours)Real-timeMixed
Storage budgetGenerousConstrainedConstrained

The medallion architecture is a powerful pattern - but it's a starting point, not a mandate. The best data architectures are:

  1. Purpose-driven: Each layer solves a specific problem
  2. Cost-conscious: Don't store or compute "just in case"
  3. Quality-first: Test and validate at every boundary
  4. Incremental: Process only what changed
  5. Governed: Control Gold layer sprawl
  6. Observable: Track lineage and dependencies
  7. Evolvable: Handle schema changes gracefully

Remember: Bronze, Silver, and Gold are colors, not commandments. Your architecture should serve your business requirements, not the other way around.

The best data engineers know when to follow the pattern - and when to break it.

Related reading:


Struggling with data architecture decisions or dealing with medallion architecture pitfalls? Our team at Black Dog Labs helps companies design pragmatic, cost-effective data platforms that balance best practices with real-world constraints. Contact us to discuss your challenges.