Retail ETL Reliability
A Python ETL that ingests messy retail exports, applies deterministic cleaning and validation, and promotes curated dimension + fact tables to SQLite for lightweight analytics.
PythonPandas/PolarsSQLite
Challenges
Dirty & inconsistent source data
Handled invalid or missing dates, mismatched totals, inconsistent categories, and duplicates. Validation rules and anomaly flags preserved lineage instead of throwing data away.
Designing an analytics-ready schema
Modeled a star schema (fact_transactions, dim_date, dim_product, dim_store, dim_payment) with staging tables for idempotent loads and referential checks before promotion.
Deterministic deduplication
Introduced composite business keys and load timestamps to power deterministic upserts, documenting merge rules for transparency.
What I learned
- Codifying data-quality rules and surfacing QA flags for stakeholders.
- Balancing dimensional modeling tradeoffs for small OLAP workloads.
- Designing idempotent, restartable loads with clear lineage.
NYC Taxi Medallion Pipeline (Local Lakehouse)
A local lakehouse pipeline processing NYC Taxi parquet files: Bronze unifies raw data, Silver standardizes and flags data quality issues, and Gold dbt models in DuckDB deliver a queryable star schema.
PolarsDuckDBdbtPython
Challenges
Large files on a single machine
Leverages Polars lazy scans, projection/predicate pushdown, and chunked writes to keep peak memory modest while still performing joins and aggregations.
Medallion contracts & QA flags
Standardized canonical names, normalized datetimes, enforced enums, and added QA flags for outliers, negative fares, and improbable speeds.
Gold star schema with tests
Modeled FACT_TRIPS with supporting dimensions in dbt and layered tests for unique keys, not-null constraints, and custom fare component tolerances.
What I learned
- Applying medallion architecture on commodity hardware.
- Promoting only contract-compliant rows forward in the pipeline.
- Authoring dbt models and tests in DuckDB for trusted marts.
Real-Time Stock Streaming to Snowflake
Simulates live market ticks into Kafka, aggregates them to 1-second OHLCV bars, reconciles against raw data, and micro-batches trusted bars into Snowflake with cost-aware warehouse usage.
KafkaPythonSnowflakeStreamlit
Challenges
Coordinating multi-stage streaming
Split producer, consumer, and aggregator roles with explicit JSON contracts and monitored end-to-end latency while keeping micro-batches backpressure-safe.
Cost & correctness in Snowflake
Used auto-suspend XS warehouses and wrote 5-row-per-minute aggregates instead of raw ticks, with reconciliation checks to guarantee bar accuracy.
Observability & validation
Added ingest latency and consumer lag metrics plus a replay validator comparing aggregated vs. raw streams for drift detection.
What I learned
- Designing event-driven pipelines with durable messaging and contracts.
- Balancing latency, cost, and fidelity landing streaming data in warehouses.
- Building lightweight validation to keep real-time aggregates trustworthy.