Forbes AI 50 - PE Dashboard (Project ORBIT)

Automated data pipeline for scraping and analyzing Forbes AI 50 companies for private equity intelligence.

Submission Deliverables

GitHub repo : https://github.com/Team-01-DAMG-7245/pe-dashboard-ai50
EVAL.md : https://github.com/Team-01-DAMG-7245/pe-dashboard-ai50/blob/swara/EVAL.md
Demo video : https://drive.google.com/file/d/1abFljwrx1lSxF5LOVPup7tTnjKDvDKF8/view?usp=sharing
Youtube video : https://youtu.be/BzeN0LC2-8Q
Reflection.md for Evaluation : https://github.com/Team-01-DAMG-7245/pe-dashboard-ai50/blob/main/REFLECTION.md

Quick Start

Run Airflow (Docker)

docker compose up
# Access UI: http://localhost:8080 (admin/admin)

Run App Locally (Dev)

python -m venv airflow_env
source airflow_env/bin/activate
pip install -r requirements.txt
uvicorn src.api:app --reload        # http://localhost:8000
streamlit run src/streamlit_app.py  # http://localhost:8501

Project Structure

├── dags/                  # Airflow DAGs (Labs 2-3)
├── data/
│   ├── forbes_ai50_seed.json   # Company list (Lab 0)
│   ├── raw/                     # Scraped HTML/text (Lab 1)
│   ├── structured/              # Pydantic models (Lab 5)
│   └── payloads/                # Dashboard payloads (Lab 6)
├── src/
│   ├── api.py                   # FastAPI endpoints (Lab 7-8)
│   ├── models.py                # Pydantic schemas (Lab 5)
│   ├── s3_utils.py              # Cloud storage (Lab 1)
│   └── streamlit_app.py         # Dashboard UI (Lab 10)
├── requirements.txt
├── docker-compose.yml
└── README.md

Setup

1. AWS S3 Configuration

aws configure  # Enter your credentials
export AWS_BUCKET_NAME=quanta-ai50-data

2. Seed File

Populate data/forbes_ai50_seed.json with Forbes AI 50 companies from https://www.forbes.com/lists/ai50/

3. Environment Variables

Create .env:

AWS_BUCKET_NAME=quanta-ai50-data
OPENAI_API_KEY=your-api-key-here

Running Labs 4-9

Prerequisites

Create .env file in project root with:
```
OPENAI_API_KEY=your-api-key-here
```
Install dependencies:
```
pip install -r requirements.txt
```

Note: All scripts automatically load API keys from .env file. No need to set environment variables manually.

Step 1: Lab 4 - Vector DB & RAG Index

Build Vector DB Index:

python src\lab4\index_for_rag_all.py

Expected Output: Vector DB should have 294+ chunks indexed.

Verify:

Get-ChildItem data\vector_db -Recurse | Measure-Object

Step 2: Lab 5 - Structured Extraction

Extract Structured Data:

python src\lab5\structured_extraction.py

Expected Output: At least 5 companies with structured data in data/structured/.

Verify:

Get-ChildItem data\structured -Filter *.json | Measure-Object

Step 3: Lab 6 - Payload Assembly

Assemble Payloads:

python src\lab6\assemble_payloads.py

Expected Output: Payloads assembled for all companies with structured data in data/payloads/.

Verify:

Get-ChildItem data\payloads -Filter *.json | Measure-Object

Step 4: Lab 7 & 8 - Dashboard API

Start API Server:

.\start_api_server.ps1

Note: Make sure you have a .env file in the project root with:

OPENAI_API_KEY=your-api-key-here

The server will start on http://localhost:8002

Access API:

Swagger UI: http://localhost:8002/docs (Interactive API testing)
Health Check: http://localhost:8002/health
RAG Dashboard: POST http://localhost:8002/dashboard/rag
Structured Dashboard: POST http://localhost:8002/dashboard/structured

Test API Endpoints in Swagger UI:

Lab 7 - RAG Dashboard (POST /dashboard/rag):
- Open http://localhost:8002/docs in your browser
- Click on POST /dashboard/rag endpoint
- Click "Try it out"
- Enter JSON:
```
{
  "company_id": "anthropic",
  "top_k": 10
}
```
- Click "Execute"
Lab 8 - Structured Dashboard (POST /dashboard/structured):
- Click on POST /dashboard/structured endpoint
- Click "Try it out"
- Enter JSON:
```
{
  "company_id": "anthropic"
}
```
- Click "Execute"

Expected Output:

Lab 7: Generates 8-section dashboard using RAG pipeline with "Not disclosed." for missing data
Lab 8: Generates 8-section dashboard using structured payload (more precise, no "Missing Key Information" in Disclosure Gaps)

Step 5: Lab 9 - Evaluation & Comparison

Generate Evaluation Dashboards:

Open a NEW PowerShell terminal (keep the API server running) and run:

.\run_lab9.ps1

Note: Make sure you have a .env file in the project root with your OPENAI_API_KEY.

Output:

Dashboards saved in src/lab9/evaluation_output/ directory
Generates both RAG and Structured dashboards for all companies with payloads

Fill Evaluation:

Review dashboards in src/lab9/evaluation_output/
Fill out src/lab9/EVAL.md with rubric scores (0-10 points per company)
Complete comparison of RAG vs Structured methods

Expected Output: EVAL.md with rubric and scores for all companies with payloads.

Labs Progress

✅ Lab 0: Project Bootstrap

Repository structure with dags/, data/, src/
forbes_ai50_seed.json with all 50 companies

✅ Lab 1: Scrape & Store

Python scraper for homepage, about, product, careers, blog
Stores raw HTML + clean text locally
Uploads to S3: s3://quanta-ai50-data/ai50/raw/

✅ Lab 2: Full Load Airflow DAG

ai50_full_ingest_dag.py - scrapes all 50 companies
Schedule: @once (manual trigger)
Output: data/raw/<company_id>/initial/ + S3

✅ Lab 3: Daily Refresh Airflow DAG

ai50_daily_refresh_dag.py - daily updates
Schedule: 0 3 * * * (3 AM UTC)
Tracks changes with content hashing
Creates dated subfolders per run

✅ Lab 4: Vector DB & RAG Index

Chunk text into 500-1000 tokens
Embed and store in vector DB (ChromaDB)
FastAPI endpoint: /rag/search (Lab 4 API on port 8001)

✅ Lab 5: Structured Extraction (Pydantic)

Use instructor library with LLM
Extract: Company, Event, Snapshot, Product, Leadership, Visibility
Output: data/structured/<company_id>.json

✅ Lab 6: Payload Assembly

Combine all structured data into dashboard payload
Output: data/payloads/<company_id>.json
Validation: All payloads can be loaded by src/structured_pipeline.py

✅ Lab 7: RAG Pipeline Dashboard

Endpoint: POST /dashboard/rag
Vector DB → LLM → 8-section Markdown dashboard
Uses "Not disclosed." for missing data

✅ Lab 8: Structured Pipeline Dashboard

Endpoint: POST /dashboard/structured
Pydantic payload → LLM → Markdown dashboard
More precise and less hallucinatory than RAG

✅ Lab 9: Evaluation & Comparison

Compare RAG vs Structured for 5+ companies
Rubric: factual correctness, schema adherence, hallucination control
Output: EVAL.md with scores and findings

✅ Lab 10: Dockerize FastAPI + Streamlit

docker-compose.yml for app layer
FastAPI: http://localhost:8000
Streamlit: http://localhost:8501

✅ Lab 11: DAG ↔ App Integration

Daily DAG writes to data/payloads/
App reads from payloads for dashboard generation

Key Commands

# Airflow
docker compose up               # Start Airflow
docker compose down             # Stop Airflow
docker compose logs -f          # View logs

# Check S3
aws s3 ls s3://quanta-ai50-data/ai50/raw/

# Check scraped data
ls data/raw/ | wc -l           # Count companies scraped

# Labs 4-9 - Complete Sequence
# Step 1: Lab 4 - Vector DB Index
python src\lab4\index_for_rag_all.py

# Step 2: Lab 5 - Structured Extraction
python src\lab5\structured_extraction.py

# Step 3: Lab 6 - Payload Assembly
python src\lab6\assemble_payloads.py

# Step 4: Lab 7/8 - Start API Server (run in separate terminal)
.\start_api_server.ps1

# Step 5: Lab 9 - Evaluation (in new terminal, after API is running)
.\run_lab9.ps1

Troubleshooting

DLL Errors (Windows):

Install Visual C++ Redistributables: https://aka.ms/vs/17/release/vc_redist.x64.exe
Restart terminal after installation

API Not Starting:

Check if port 8002 is available
Verify .env file has OPENAI_API_KEY set
Ensure virtual environment is activated

Vector DB Empty:

Run python src\index_for_rag_all.py to populate vector DB
Check that data/raw/ contains company directories

Docker Issues:

Ensure Docker Desktop is running
Check docker-compose.yml configuration
View logs: docker compose logs -f

Contributions

Swara: Phase 1 (Labs 0–1), Phase 2 (Lab 5), Phase 3 (Labs 8–9)
Nat: Phase 1 (Labs 2–3), Phase 4 (Labs 10–11)
Kundana: Phase 2 (Labs 4, 6),Phase 3(Lab 7)

Roles & Responsibilities

Swara
- Scraping (Lab 1), structured extraction (Lab 5)
- Evaluation and Documentation (EVAL.md, REFLECTION.md)
- Dashboard generation via Structured pipeline (Lab 8) and evaluation (Lab 9)
Nat
- Airflow DAGs (full-load and daily) and deployment (Labs 2–3)
- Frontend/Streamlit and containerization in Phase 4 (Labs 10–11)
Kundana
- RAG-based pipeline and retrieval experiments (Labs 4, 7)
- Payload assembly integration (Lab 6)

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
airflow_env		airflow_env
dags		dags
data		data
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
Assignment.md		Assignment.md
CONTRIBUTION_ATTESTATION.txt		CONTRIBUTION_ATTESTATION.txt
Dockerfile		Dockerfile
EVAL.md		EVAL.md
LICENSE		LICENSE
README.md		README.md
REFLECTION.md		REFLECTION.md
RUN_ALL_LABS.md		RUN_ALL_LABS.md
calculate_evaluation_matrix.py		calculate_evaluation_matrix.py
docker-compose.yml		docker-compose.yml
generate_anthropic_dashboard.ps1		generate_anthropic_dashboard.ps1
requirements-docker.txt		requirements-docker.txt
requirements.txt		requirements.txt
run_lab9.ps1		run_lab9.ps1
start_api_server.ps1		start_api_server.ps1

License

natnichalerdwa/pe-dashboard-ai50

Folders and files

Latest commit

History

Repository files navigation

Forbes AI 50 - PE Dashboard (Project ORBIT)

Submission Deliverables

Quick Start

Run Airflow (Docker)

Run App Locally (Dev)

Project Structure

Setup

1. AWS S3 Configuration

2. Seed File

3. Environment Variables

Running Labs 4-9

Prerequisites

Step 1: Lab 4 - Vector DB & RAG Index

Step 2: Lab 5 - Structured Extraction

Step 3: Lab 6 - Payload Assembly

Step 4: Lab 7 & 8 - Dashboard API

Step 5: Lab 9 - Evaluation & Comparison

Labs Progress

✅ Lab 0: Project Bootstrap

✅ Lab 1: Scrape & Store

✅ Lab 2: Full Load Airflow DAG

✅ Lab 3: Daily Refresh Airflow DAG

✅ Lab 4: Vector DB & RAG Index

✅ Lab 5: Structured Extraction (Pydantic)

✅ Lab 6: Payload Assembly

✅ Lab 7: RAG Pipeline Dashboard

✅ Lab 8: Structured Pipeline Dashboard

✅ Lab 9: Evaluation & Comparison

✅ Lab 10: Dockerize FastAPI + Streamlit

✅ Lab 11: DAG ↔ App Integration

Key Commands

Troubleshooting

Contributions

Roles & Responsibilities

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages