- Clone the repository
- Set up AWS and Snowflake credentials (see
config/for templates) - Prepare data in Snowflake using provided SQL script
- Train models with provided Jupyter notebooks/scripts
- (Optional) Deploy model to AWS SageMaker endpoint
| Metric | Value |
|---|---|
| Accuracy | 92.1% |
| AUC-ROC | 0.89 |
| Data Volume | 2 million rows processed |
| Training Time | ~12 minutes on ml.m5.large |
All metrics are from the provided sample dataset (replace with real values if available).
An end-to-end, production-ready pipeline that ingests raw transaction data into Snowflake, trains an XGBoost fraud-detection model on SageMaker, and serves real-time predictions via a Flask API.
- Overview
- Architecture
- Installation & Quick Start
- Usage
- Quantifiable Metrics
- Project Structure
- Running Tests
- Contributing
- License
A modular, production-style machine learning pipeline integrating AWS SageMaker for scalable model training and Snowflake as a cloud data warehouse, designed for real-world enterprise data science workflows. Financial fraud detection requires both accurate models and reliable, scalable infrastructure. This project demonstrates:
- Data Ingestion: Copy raw transaction CSVs into a Snowflake staging table.
- Feature Engineering & Processing: Run a SageMaker Processing job that cleans and transforms data.
- Model Training: Train an XGBoost classifier on SageMaker (ml.m5.2xlarge), achieving high AUC.
- Real-time Inference: Host the trained model on a SageMaker endpoint and expose a Flask API on EC2 for predictions.
- Dashboard Integration (Optional): Serve inference results to a downstream dashboard or service.
- Raw Data (S3) → Snowflake
- Snowflake X-Small warehouse loads and stores raw transactions.
- SageMaker Processing
- Pulls from Snowflake, applies feature engineering, and writes processed data back.
- SageMaker Training
- Reads processed data, trains an XGBoost model, and saves the artifact to S3.
- SageMaker Endpoint
- Deploys the model for real-time inference (ml.t2.medium).
- Flask API (EC2 t3.medium)
- Forwards JSON transaction payloads to the SageMaker endpoint and returns fraud probability.
- Dashboard / Client
- Consumes API responses to visualize fraud scores.
- Docker (≥ 19.x)
- AWS credentials with permissions for SageMaker, S3, and IAM
- Snowflake account +
SNOWFLAKE_ACCOUNT,SNOWFLAKE_USER,SNOWFLAKE_PASSWORD, and role with privileges
git clone https://github.com/Trojan3877/AWS-SageMaker-Snowflake-ML-Pipeline.git
cd AWS-SageMaker-Snowflake-ML-Pipeline