Skip to content

nguyenpavel/rag-with-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

BoardGame RAG - Retrieval-Augmented Chatbot for BoardGameGeek-scale Data

Overview

BoardGame RAG is a serverless, AWS-native Retrieval-Augmented Generation system that ingests 150k+ board game entries (CSV), enriches with XML metadata (BoardGameGeek API), and attaches PDF rulebooks gathered via a crawler. A web app (Amplify) provides an authenticated chat interface, where questions are answered with cited sources using AWS Kendra retrieval and Bedrock foundation models (Claude, Jurassic2, Titan).

This README turns the assignment write-up into an actionable documentation for contributors and reviewers.

Features

  • 🔐 Cognito auth (username/password) for the web app
  • 🧠 RAG with Kendra retrieval + Bedrock generation (Claude default)
  • 📚 Multi-modal corpora: CSV (ranks), XML (descriptions, awards, players), PDFs (rulebooks)
  • 🧩 Vector support via OpenSearch Serverless (chunk size 300 tokens) for experimentation
  • 🧰 LangChain orchestration inside Lambda (RetrievalQA, PromptTemplate per model)
  • 🗄️ RDS PostgreSQL for interaction data and crawler outputs
  • 🪄 AWS Glue transforms and Glue Data Catalog for schema/lineage
  • 📦 IaC using CloudFormation/SAM; Lambda packaged as container images (ECR)
  • 🚀 CI/CD with CodeBuild + CodePipeline (source: GitHub)
  • 📈 Observability via CloudWatch; budgets via AWS Budgets

Architecture

RAG Flow

  1. User submits a prompt from Amplify app (Cognito authenticated).
  2. Lambda orchestrator converts prompt to embeddings, queries Kendra (semantic search over PDFs, XML, tables).
  3. Relevant context is fed into Bedrock model (Claude default).
  4. Response is generated with citations and returned to the UI.
  5. Interactions and telemetry optionally stored in RDS for analysis.

Data Pipeline

  • Weekly (EventBridge → Step Functions):

    1. Download bg_ranks CSV → S3 (<your-s3-raw-bucket>)
    2. Transform via Glue (Spark 3.3; clean decimals, rename bayesaverage/average) → S3 processed
    3. Load into RDS (copy/upsert strategy)
    4. Crawl rulebook URLs (Lambda) → store URLs in boardgame_url (RDS)
    5. Download PDFs to S3 public
    6. (Re)index Kendra and optionally OpenSearch vectors
  • Monthly: refresh the full BoardGameGeek CSV dump.

Tech Stack

  • AWS: Amplify, Cognito, API Gateway, Lambda (Python 3.9), Bedrock, Kendra, OpenSearch Serverless, RDS Postgres, S3, Glue, Step Functions, EventBridge, ECR, CloudWatch, Budgets
  • Python: LangChain, requests, urllib3, psycopg2
  • Frontend: React (Amplify UI), minimal chat client with model/temperature/token controls
  • IaC: CloudFormation/SAM; Docker images for Lambda (ECR)

Infrastructure as Code

  • Reproducible environments via CloudFormation templates.
  • Lambda functions are packaged as container images to pin layers/dependencies and avoid runtime drift.
  • Example resources defined: Cognito User Pool/Client, Kendra Index & data sources (S3, RDS), OpenSearch collection, RDS instance/SGs, S3 buckets (raw/processed/public), Step Functions, EventBridge rules, IAM roles/policies (least privilege).

Local Development

  1. Prereqs: Docker, Python 3.9, AWS CLI v2, SAM CLI, Node.js (for app)
  2. Clone repo and create virtualenv.
  3. Copy .env.example.env and fill values (or use AWS SSM/Secrets Manager in dev).
  4. Run unit tests; run SAM local for Lambda where feasible.

Backend

make venv && source .venv/bin/activate pip install -r requirements.txt

Lint & test

pytest -q

SAM local (example)

sam build && sam local invoke OrchestratorFunction
--env-vars sam.local.env.json
--event tests/fixtures/prompt_event.json

Frontend (Amplify app)

cd app && npm i && npm run dev

Deployment

Backend (Lambda/SAM)

sam build sam deploy
--stack-name boardgame-rag
--resolve-s3
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
--parameter-overrides Env=prod

Frontend (Amplify)

  • Connect the GitHub repo in Amplify Console.
  • Configure build settings via amplify.yml or Amplify defaults.
  • Set environment variables (Cognito pool id, API base URL).
  • On push to main, Amplify builds and deploys.

Kendra / OpenSearch / RDS / S3

  • Create Kendra Index and Data Sources (S3 buckets for XML/PDF; RDS connector for tables).
  • Ensure KMS encryption keys and IAM roles are wired.
  • For OpenSearch Serverless, define collection, vector index (dimensionality per embedding model), and IAM access.
  • RDS Postgres: apply schema migrations for tables: boardgames, boardgame_url, interactions, etc.
  • S3 buckets: raw, processed, public-pdf with distinct access policies.

IAM follows least privilege. Deny-all boundary where possible; grant task-specific actions only.

Usage

  • Navigate to the Amplify URL, sign in via Cognito.
  • Choose model, set temperature, token limit, and ask a question (e.g., “Explain the setup for Terraforming Mars; cite the rulebook.”).
  • The UI displays an answer and a list of source documents (PDF/XML/rows).

API (example)

POST /chat Content-Type: application/json Authorization:

{ "prompt": "How many players can play Azul?", "model": "anthropic.claude-3-sonnet", "temperature": 0.2, "max_tokens": 400 }

Evaluation

Evaluation is performed with a judge LLM (e.g., Llama 2) inside Bedrock where applicable, across:

  • Accuracy (reference agreement)
  • Toxicity (safety)
  • Robustness (input paraphrase stability)

Datasets: BoolQ, Natural Questions, TriviaQA (subset samplers committed to /eval).

Illustrative results (Claude 2)

  • BoolQ: 0.00022
  • Natural Questions: 0.0741
  • TriviaQA: 0.133

Action items: expand domain-specific eval set (boardgame Q&A), tune retrieval (chunking, hybrid keyword+semantic), calibrate prompts per model via PromptTemplate.

Security

  • Network: Private subnets for Lambdas; NAT for egress; VPC endpoints to S3/RDS.
  • CORS: restricted origins for the web app domains.
  • Encryption: KMS for Kendra, S3, RDS (at rest) and TLS in transit.
  • IAM: fine-grained roles per Lambda; no wildcard * in production.
  • Pii/Secrets: keep in SSM/Secrets Manager; never log secrets.

CI/CD

  • GitHubCodeBuild (builds zips/images → S3/ECR) → CodeDeploy/CodePipelineLambda.
  • Amplify auto-deploys the frontend on branch push.
  • CloudWatch captures build logs; alarms on failures.

Monitoring & Cost Control

  • CloudWatch metrics/alarms for Lambda durations, errors, throttles; Kendra query/ingest metrics.
  • AWS Budgets with email alerts at 50%/80%/100% thresholds.
  • Consider Bedrock concurrency and max tokens guardrails.

Data Lineage

  • Glue Crawlers populate Glue Data Catalog across S3 paths and RDS tables.
  • Track provenance from raw CSV → processed S3 → RDS → Kendra/OpenSearch indexes.

Roadmap

  • Graph augmentation via Amazon Neptune for relationships (designers, publishers, expansions)
  • Domain eval set (gold Q&A) + RAGAS/ELO style metrics
  • EC2-based long-running crawlers to bypass Lambda 15-min limit
  • UI polish: citations preview, doc viewer, feedback thumbs with reason
  • Hybrid retrieval (BM25 + dense) and re-ranking
  • Multi-region data residency (EU) with per-region Kendra indexes

Troubleshooting

  • Kendra returns no results: verify data sources sync status, IAM for Kendra to S3/RDS, file formats (PDF OCR).
  • Lambda in VPC can’t reach the internet: ensure route tables and NAT Gateway configured; security groups allow egress.
  • CORS errors: align API Gateway/Amplify domains in allowlist.
  • OpenSearch 403: check collection policies and IAM identity mapping.

Acknowledgements

  • Data: BoardGameGeek dumps and API.

About

This is a project building a local RAG system with LLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages