GitHub - Trojan3877/AWS-SageMaker-Snowflake-ML-Pipeline: The **AWS SageMaker + Snowflake ML Pipeline** is a fully production-grade, end-to-end machine learning workflow designed to ingest large-scale data from Snowflake, perform feature engineering with Apache Spark, and train, tune, and deploy models on AWS SageMaker—all orchestrated and versioned with CI/CD, Terraform, and Ansible.

Clone the repository
Set up AWS and Snowflake credentials (see config/ for templates)
Prepare data in Snowflake using provided SQL script
Train models with provided Jupyter notebooks/scripts
(Optional) Deploy model to AWS SageMaker endpoint

Results

Metric	Value
Accuracy	92.1%
AUC-ROC	0.89
Data Volume	2 million rows processed
Training Time	~12 minutes on ml.m5.large

All metrics are from the provided sample dataset (replace with real values if available).

An end-to-end, production-ready pipeline that ingests raw transaction data into Snowflake, trains an XGBoost fraud-detection model on SageMaker, and serves real-time predictions via a Flask API.

Overview

A modular, production-style machine learning pipeline integrating AWS SageMaker for scalable model training and Snowflake as a cloud data warehouse, designed for real-world enterprise data science workflows. Financial fraud detection requires both accurate models and reliable, scalable infrastructure. This project demonstrates:

Data Ingestion: Copy raw transaction CSVs into a Snowflake staging table.
Feature Engineering & Processing: Run a SageMaker Processing job that cleans and transforms data.
Model Training: Train an XGBoost classifier on SageMaker (ml.m5.2xlarge), achieving high AUC.
Real-time Inference: Host the trained model on a SageMaker endpoint and expose a Flask API on EC2 for predictions.
Dashboard Integration (Optional): Serve inference results to a downstream dashboard or service.

Architecture

Raw Data (S3) → Snowflake
- Snowflake X-Small warehouse loads and stores raw transactions.
SageMaker Processing
- Pulls from Snowflake, applies feature engineering, and writes processed data back.
SageMaker Training
- Reads processed data, trains an XGBoost model, and saves the artifact to S3.
SageMaker Endpoint
- Deploys the model for real-time inference (ml.t2.medium).
Flask API (EC2 t3.medium)
- Forwards JSON transaction payloads to the SageMaker endpoint and returns fraud probability.
Dashboard / Client
- Consumes API responses to visualize fraud scores.

Installation & Quick Start

Prerequisites

Docker (≥ 19.x)
AWS credentials with permissions for SageMaker, S3, and IAM
Snowflake account + SNOWFLAKE_ACCOUNT, SNOWFLAKE_USER, SNOWFLAKE_PASSWORD, and role with privileges

Clone & Build

git clone https://github.com/Trojan3877/AWS-SageMaker-Snowflake-ML-Pipeline.git
cd AWS-SageMaker-Snowflake-ML-Pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
docs		docs
notebooks		notebooks
src		src
training		training
.env		.env
.gitignore		.gitignore
Docker		Docker
IMPROVEMENT_PLAN.md		IMPROVEMENT_PLAN.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
folder structure		folder structure
pgsql		pgsql
project structure		project structure
python-ci.yml		python-ci.yml
python-dotenv		python-dotenv
scss		scss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Results

Table of Contents

Overview

Architecture

Installation & Quick Start

Prerequisites

Clone & Build

About

Uh oh!

Releases

Packages

Languages

License

Trojan3877/AWS-SageMaker-Snowflake-ML-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Results

Table of Contents

Overview

Architecture

Installation & Quick Start

Prerequisites

Clone & Build

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages