Projects

Retail ETL Reliability

A Python ETL that ingests messy retail exports, applies deterministic cleaning and validation, and promotes curated dimension + fact tables to SQLite for lightweight analytics.

PythonPandas/PolarsSQLite

Challenges

Dirty & inconsistent source data

Handled invalid or missing dates, mismatched totals, inconsistent categories, and duplicates. Validation rules and anomaly flags preserved lineage instead of throwing data away.

Designing an analytics-ready schema

Modeled a star schema (fact_transactions, dim_date, dim_product, dim_store, dim_payment) with staging tables for idempotent loads and referential checks before promotion.

Deterministic deduplication

Introduced composite business keys and load timestamps to power deterministic upserts, documenting merge rules for transparency.

What I learned

Codifying data-quality rules and surfacing QA flags for stakeholders.
Balancing dimensional modeling tradeoffs for small OLAP workloads.
Designing idempotent, restartable loads with clear lineage.

Repository

NYC Taxi Medallion Pipeline (Local Lakehouse)

A local lakehouse pipeline processing NYC Taxi parquet files: Bronze unifies raw data, Silver standardizes and flags data quality issues, and Gold dbt models in DuckDB deliver a queryable star schema.

PolarsDuckDBdbtPython

Challenges

Large files on a single machine

Leverages Polars lazy scans, projection/predicate pushdown, and chunked writes to keep peak memory modest while still performing joins and aggregations.

Medallion contracts & QA flags

Standardized canonical names, normalized datetimes, enforced enums, and added QA flags for outliers, negative fares, and improbable speeds.

Gold star schema with tests

Modeled FACT_TRIPS with supporting dimensions in dbt and layered tests for unique keys, not-null constraints, and custom fare component tolerances.

What I learned

Applying medallion architecture on commodity hardware.
Promoting only contract-compliant rows forward in the pipeline.
Authoring dbt models and tests in DuckDB for trusted marts.

Repository

Real-Time Stock Streaming to Snowflake

Simulates live market ticks into Kafka, aggregates them to 1-second OHLCV bars, reconciles against raw data, and micro-batches trusted bars into Snowflake with cost-aware warehouse usage.

KafkaPythonSnowflakeStreamlit

Challenges

Coordinating multi-stage streaming

Split producer, consumer, and aggregator roles with explicit JSON contracts and monitored end-to-end latency while keeping micro-batches backpressure-safe.

Cost & correctness in Snowflake

Used auto-suspend XS warehouses and wrote 5-row-per-minute aggregates instead of raw ticks, with reconciliation checks to guarantee bar accuracy.

Observability & validation

Added ingest latency and consumer lag metrics plus a replay validator comparing aggregated vs. raw streams for drift detection.

What I learned

Designing event-driven pipelines with durable messaging and contracts.
Balancing latency, cost, and fidelity landing streaming data in warehouses.
Building lightweight validation to keep real-time aggregates trustworthy.

Repository