Skip to content

gordonmurray/apache_flink_and_iceberg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Modern Data Lakehouse with Real time CDC & Data Quality Monitoring

A complete data engineering project demonstrating modern data stack practices with Apache Flink, Iceberg, Trino and Superset.

Project Overview

This project showcases a modern data lakehouse architecture that captures, processes, stores, and monitors data in real-time. Built with industry standard tools and best practices, it demonstrates:

  • Real-time Change Data Capture (CDC) from mySQL.
  • Stream processing with Apache Flink
  • Modern data lake storage with Apache Iceberg format
  • High-performance analytics via Trino SQL engine
  • Automated data quality monitoring with validation using Soda
  • Interactive dashboards through Apache Superset

πŸ—οΈ Architecture

graph TB
    subgraph "Source Systems"
        MySQL[MySQL Database<br/>πŸ“Š Products & Sales Data]
    end

    subgraph "Stream Processing"
        Flink[Apache Flink<br/>πŸ”„ CDC & Stream Processing]
    end

    subgraph "Data Lake"
        Iceberg[Apache Iceberg<br/>πŸ—‚οΈ ACID Transactions]
        MinIO[MinIO Object Storage<br/>πŸ’Ύ S3-Compatible Storage]
    end

    subgraph "Analytics & Quality"
        Trino[Trino SQL Engine<br/>⚑ Fast Distributed Queries]
        Soda[Data Quality Monitor<br/>πŸ” Automated Validation]
    end

    subgraph "Visualization"
        Superset[Apache Superset<br/>πŸ“ˆ Interactive Dashboards]
    end

    MySQL -->|CDC Stream| Flink
    Flink -->|Write| Iceberg
    Iceberg -->|Store| MinIO
    Iceberg --> Trino
    Trino --> Superset
    Trino --> Soda

    style MySQL fill:#e1f5fe
    style Flink fill:#f3e5f5
    style Iceberg fill:#e8f5e8
    style Trino fill:#fff3e0
    style Soda fill:#fce4ec
    style Superset fill:#f1f8e9
Loading

Key Features

πŸ”„ Real-time Data Pipeline

  • MySQL CDC Source: Captures all changes (INSERT, UPDATE, DELETE) in real-time
  • Apache Flink Streaming: Processes data streams with exactly-once guarantees
  • Apache Iceberg: Modern table format with ACID transactions and time travel
  • MinIO Storage: High-performance, S3-compatible object storage

πŸ” Data Quality Assurance

  • Automated Quality Checks: 12+ comprehensive validation rules
  • Real-time Monitoring: Continuous data quality assessment every 5 minutes
  • Business Rule Validation: Custom checks for data integrity and business logic
  • Failure Detection: Immediate alerts for data quality issues

πŸ“Š Analytics & Visualization

  • Trino SQL Engine: Fast, distributed queries across data lake
  • Apache Superset: Interactive dashboards and data exploration
  • Real-time Insights: Up-to-the-minute analytics on streaming data

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • 8GB+ RAM recommended
  • 10GB+ free disk space

1. Launch the Stack

# Clone and start the complete data platform
git clone [email protected]:gordonmurray/apache_flink_and_iceberg.git
cd apache_flink_and_iceberg

# Start all services (takes 2-3 minutes)
docker compose up -d

# Verify all services are running
docker ps

2. Submit Flink CDC Jobs

# Submit the streaming CDC jobs
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/job.sql
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/products_streaming.sql
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/sales_streaming.sql

3. Verify Data Quality

Soda can actively push alerts to channels such as Slack, for this project is just writes to a log.

# Check data quality monitoring results
docker logs soda --tail 30

# Expected output: "πŸŽ‰ All data quality checks PASSED!"

🌐 Access Points

Service URL Purpose
Flink Web UI http://localhost:8081 Monitor streaming jobs
MinIO Console http://localhost:9001 Browse data lake storage
Trino UI http://localhost:8080 Query execution monitoring
Superset http://localhost:8088 Data visualization dashboards

Default Credentials:

  • MinIO: minio / minio123
  • Superset: admin / admin

πŸ“ˆ Demo Scenarios

Scenario 1: Real-time Data Ingestion

# Insert new sales data
docker exec mysql mysql -u root -prootpw -e "
  INSERT INTO appdb.sales (product_id, qty, price, sale_ts)
  VALUES (1, 5, 29.99, NOW());
"

# Watch data flow through pipeline (30-60 seconds)
docker logs soda --follow

# Query the new data via Trino
docker exec trino trino --execute "
  SELECT COUNT(*) as total_sales
  FROM iceberg.demo.sales;
"

Scenario 2: Data Quality Validation

# Insert invalid data (negative price)
docker exec mysql mysql -u root -prootpw -e "
  INSERT INTO appdb.sales (product_id, qty, price, sale_ts)
  VALUES (2, 3, -15.00, NOW());
"

# Watch data quality alerts trigger
docker logs soda --tail 50

# Expected: "⚠️ X checks FAILED - Review data quality issues"

Scenario 3: Analytics & Insights

# Run analytics queries via Trino
docker exec trino trino --execute "
  SELECT
    p.name as product_name,
    COUNT(*) as sales_count,
    SUM(s.qty * s.price) as total_revenue
  FROM iceberg.demo.sales s
  JOIN iceberg.demo.products p ON s.product_id = p.id
  GROUP BY p.name
  ORDER BY total_revenue DESC;
"

πŸ“Š Data Quality Monitoring

The data quality system validates:

Products Table

  • βœ… Data Completeness: No null IDs, SKUs, or names
  • βœ… Uniqueness: Duplicate detection for primary/business keys
  • βœ… Format Validation: SKU pattern compliance (P-XXX format)
  • βœ… Referential Integrity: Valid relationships maintained

Sales Table

  • βœ… Business Rules: Positive quantities and prices only
  • βœ… Data Freshness: Recent transactions within 24 hours
  • βœ… Outlier Detection: Reasonable quantity limits (≀100)
  • βœ… Complete Records: No missing required fields

πŸ“‹ For detailed usage: See Soda_guide.md

πŸ“ˆ Dashboards & Analytics

Apache Superset provides:

  • Sales Performance Dashboards: Revenue trends, top products
  • Data Quality Metrics: Real-time quality score tracking
  • Operational Monitoring: Pipeline health and throughput
  • Interactive Exploration: Ad-hoc data analysis

πŸ“‹ For dashboard setup: See Superset_guide.md

πŸ“ Project Structure

β”œβ”€β”€ docker-compose.yml          # Complete stack orchestration
β”œβ”€β”€ Dockerfile                  # Custom Flink image with CDC connectors
β”œβ”€β”€ README.md                   # This comprehensive guide
β”œβ”€β”€ Soda_guide.md              # Data quality monitoring guide
β”œβ”€β”€ Superset_guide.md          # Dashboard and visualization guide
β”œβ”€β”€ jobs/                      # Flink SQL job definitions
β”‚   β”œβ”€β”€ job.sql                # Main CDC pipeline setup
β”‚   β”œβ”€β”€ products_streaming.sql # Products CDC streaming job
β”‚   └── sales_streaming.sql    # Sales CDC streaming job
β”œβ”€β”€ sql/                       # Database initialization
β”‚   β”œβ”€β”€ init.sql              # Sample data and schema
β”‚   └── mariadb.cnf           # MySQL configuration
β”œβ”€β”€ soda/                      # Data quality monitoring
β”‚   β”œβ”€β”€ data_quality_monitor.py # Custom monitoring solution
β”‚   β”œβ”€β”€ configuration.yml     # Trino connection settings
β”‚   β”œβ”€β”€ checks.yml            # Quality check definitions
β”‚   └── Dockerfile           # Monitoring container setup
β”œβ”€β”€ trino/                     # Trino configuration
β”‚   β”œβ”€β”€ etc/catalog/          # Iceberg catalog configuration
β”‚   └── lib/                  # MySQL connector JAR
β”œβ”€β”€ superset/                  # Superset configuration
β”‚   β”œβ”€β”€ docker-entrypoint.sh # Initialization script
β”‚   └── setup_dashboard.py   # Automated dashboard creation
└── minio/                     # MinIO setup
    └── create-bucket.sh      # Storage bucket initialization

πŸ§ͺ Testing & Validation

End-to-End Pipeline Test

# 1. Insert test data
docker exec mysql mysql -u root -prootpw -e "
  INSERT INTO appdb.products (sku, name) VALUES ('P-999', 'Test Product');
  INSERT INTO appdb.sales (product_id, qty, price, sale_ts) VALUES (1, 10, 99.99, NOW());
"

# 2. Verify data pipeline (wait 60 seconds)
sleep 60

# 3. Check data quality
docker exec soda python data_quality_monitor.py

# 4. Query final results
docker exec trino trino --execute "
  SELECT COUNT(*) FROM iceberg.demo.products;
  SELECT COUNT(*) FROM iceberg.demo.sales;
"

Performance Benchmarks

  • Ingestion Rate: 1000+ records/second
  • Query Latency: Sub-second for most analytics
  • Data Freshness: 30-60 second end-to-end latency
  • Quality Validation: 12 checks in <5 seconds

πŸ› οΈ Customization Guide

Adding New Data Sources

  1. Create CDC connector configuration in jobs/
  2. Define Iceberg target tables
  3. Add quality checks in soda/data_quality_monitor.py
  4. Update Superset dashboards

About

A complete data engineering project demonstrating modern data stack practices with Apache Flink, Iceberg, Trino and Superset

Topics

Resources

License

Stars

Watchers

Forks