A complete data engineering project demonstrating modern data stack practices with Apache Flink, Iceberg, Trino and Superset.
This project showcases a modern data lakehouse architecture that captures, processes, stores, and monitors data in real-time. Built with industry standard tools and best practices, it demonstrates:
- Real-time Change Data Capture (CDC) from mySQL.
- Stream processing with Apache Flink
- Modern data lake storage with Apache Iceberg format
- High-performance analytics via Trino SQL engine
- Automated data quality monitoring with validation using Soda
- Interactive dashboards through Apache Superset
graph TB
subgraph "Source Systems"
MySQL[MySQL Database<br/>π Products & Sales Data]
end
subgraph "Stream Processing"
Flink[Apache Flink<br/>π CDC & Stream Processing]
end
subgraph "Data Lake"
Iceberg[Apache Iceberg<br/>ποΈ ACID Transactions]
MinIO[MinIO Object Storage<br/>πΎ S3-Compatible Storage]
end
subgraph "Analytics & Quality"
Trino[Trino SQL Engine<br/>β‘ Fast Distributed Queries]
Soda[Data Quality Monitor<br/>π Automated Validation]
end
subgraph "Visualization"
Superset[Apache Superset<br/>π Interactive Dashboards]
end
MySQL -->|CDC Stream| Flink
Flink -->|Write| Iceberg
Iceberg -->|Store| MinIO
Iceberg --> Trino
Trino --> Superset
Trino --> Soda
style MySQL fill:#e1f5fe
style Flink fill:#f3e5f5
style Iceberg fill:#e8f5e8
style Trino fill:#fff3e0
style Soda fill:#fce4ec
style Superset fill:#f1f8e9
- MySQL CDC Source: Captures all changes (INSERT, UPDATE, DELETE) in real-time
- Apache Flink Streaming: Processes data streams with exactly-once guarantees
- Apache Iceberg: Modern table format with ACID transactions and time travel
- MinIO Storage: High-performance, S3-compatible object storage
- Automated Quality Checks: 12+ comprehensive validation rules
- Real-time Monitoring: Continuous data quality assessment every 5 minutes
- Business Rule Validation: Custom checks for data integrity and business logic
- Failure Detection: Immediate alerts for data quality issues
- Trino SQL Engine: Fast, distributed queries across data lake
- Apache Superset: Interactive dashboards and data exploration
- Real-time Insights: Up-to-the-minute analytics on streaming data
- Docker & Docker Compose
- 8GB+ RAM recommended
- 10GB+ free disk space
# Clone and start the complete data platform
git clone [email protected]:gordonmurray/apache_flink_and_iceberg.git
cd apache_flink_and_iceberg
# Start all services (takes 2-3 minutes)
docker compose up -d
# Verify all services are running
docker ps# Submit the streaming CDC jobs
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/job.sql
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/products_streaming.sql
docker exec jobmanager /opt/flink/bin/sql-client.sh -f /opt/flink/sales_streaming.sqlSoda can actively push alerts to channels such as Slack, for this project is just writes to a log.
# Check data quality monitoring results
docker logs soda --tail 30
# Expected output: "π All data quality checks PASSED!"| Service | URL | Purpose |
|---|---|---|
| Flink Web UI | http://localhost:8081 | Monitor streaming jobs |
| MinIO Console | http://localhost:9001 | Browse data lake storage |
| Trino UI | http://localhost:8080 | Query execution monitoring |
| Superset | http://localhost:8088 | Data visualization dashboards |
Default Credentials:
- MinIO:
minio/minio123 - Superset:
admin/admin
# Insert new sales data
docker exec mysql mysql -u root -prootpw -e "
INSERT INTO appdb.sales (product_id, qty, price, sale_ts)
VALUES (1, 5, 29.99, NOW());
"
# Watch data flow through pipeline (30-60 seconds)
docker logs soda --follow
# Query the new data via Trino
docker exec trino trino --execute "
SELECT COUNT(*) as total_sales
FROM iceberg.demo.sales;
"# Insert invalid data (negative price)
docker exec mysql mysql -u root -prootpw -e "
INSERT INTO appdb.sales (product_id, qty, price, sale_ts)
VALUES (2, 3, -15.00, NOW());
"
# Watch data quality alerts trigger
docker logs soda --tail 50
# Expected: "β οΈ X checks FAILED - Review data quality issues"# Run analytics queries via Trino
docker exec trino trino --execute "
SELECT
p.name as product_name,
COUNT(*) as sales_count,
SUM(s.qty * s.price) as total_revenue
FROM iceberg.demo.sales s
JOIN iceberg.demo.products p ON s.product_id = p.id
GROUP BY p.name
ORDER BY total_revenue DESC;
"The data quality system validates:
- β Data Completeness: No null IDs, SKUs, or names
- β Uniqueness: Duplicate detection for primary/business keys
- β Format Validation: SKU pattern compliance (P-XXX format)
- β Referential Integrity: Valid relationships maintained
- β Business Rules: Positive quantities and prices only
- β Data Freshness: Recent transactions within 24 hours
- β Outlier Detection: Reasonable quantity limits (β€100)
- β Complete Records: No missing required fields
π For detailed usage: See Soda_guide.md
Apache Superset provides:
- Sales Performance Dashboards: Revenue trends, top products
- Data Quality Metrics: Real-time quality score tracking
- Operational Monitoring: Pipeline health and throughput
- Interactive Exploration: Ad-hoc data analysis
π For dashboard setup: See Superset_guide.md
βββ docker-compose.yml # Complete stack orchestration
βββ Dockerfile # Custom Flink image with CDC connectors
βββ README.md # This comprehensive guide
βββ Soda_guide.md # Data quality monitoring guide
βββ Superset_guide.md # Dashboard and visualization guide
βββ jobs/ # Flink SQL job definitions
β βββ job.sql # Main CDC pipeline setup
β βββ products_streaming.sql # Products CDC streaming job
β βββ sales_streaming.sql # Sales CDC streaming job
βββ sql/ # Database initialization
β βββ init.sql # Sample data and schema
β βββ mariadb.cnf # MySQL configuration
βββ soda/ # Data quality monitoring
β βββ data_quality_monitor.py # Custom monitoring solution
β βββ configuration.yml # Trino connection settings
β βββ checks.yml # Quality check definitions
β βββ Dockerfile # Monitoring container setup
βββ trino/ # Trino configuration
β βββ etc/catalog/ # Iceberg catalog configuration
β βββ lib/ # MySQL connector JAR
βββ superset/ # Superset configuration
β βββ docker-entrypoint.sh # Initialization script
β βββ setup_dashboard.py # Automated dashboard creation
βββ minio/ # MinIO setup
βββ create-bucket.sh # Storage bucket initialization
# 1. Insert test data
docker exec mysql mysql -u root -prootpw -e "
INSERT INTO appdb.products (sku, name) VALUES ('P-999', 'Test Product');
INSERT INTO appdb.sales (product_id, qty, price, sale_ts) VALUES (1, 10, 99.99, NOW());
"
# 2. Verify data pipeline (wait 60 seconds)
sleep 60
# 3. Check data quality
docker exec soda python data_quality_monitor.py
# 4. Query final results
docker exec trino trino --execute "
SELECT COUNT(*) FROM iceberg.demo.products;
SELECT COUNT(*) FROM iceberg.demo.sales;
"- Ingestion Rate: 1000+ records/second
- Query Latency: Sub-second for most analytics
- Data Freshness: 30-60 second end-to-end latency
- Quality Validation: 12 checks in <5 seconds
- Create CDC connector configuration in
jobs/ - Define Iceberg target tables
- Add quality checks in
soda/data_quality_monitor.py - Update Superset dashboards