A Python-based web search and synthesis API that processes user queries, performs web and YouTube searches, scrapes content, and generates detailed Markdown answers with sources and images. Built for extensibility, robust error handling, and efficient information retrieval using modern async APIs and concurrency.
NEW: Now features an IPC-based embedding model server for optimized GPU resource usage and better scalability!
App Worker 1 β Local Embedding Model (GPU Memory: ~1GB)
App Worker 2 β Local Embedding Model (GPU Memory: ~1GB)
App Worker 3 β Local Embedding Model (GPU Memory: ~1GB)
Total GPU Usage: ~6GB
App Worker 1 βββ
App Worker 2 βββ€β IPC β Embedding Server (GPU Memory: ~2GB)
App Worker 3 βββ
Total GPU Usage: ~2GB (67% reduction!)
The system uses an Inter-Process Communication (IPC) architecture with browser automation and agent pooling to optimize resource usage and enable horizontal scaling:
graph TB
subgraph "Client Layer"
A1[App Worker 1<br/>Port: 5000<br/>β‘ Async Queue]
A2[App Worker 2<br/>Port: 5001<br/>β‘ Async Queue]
A3[App Worker N<br/>Port: 500X<br/>β‘ Async Queue]
end
subgraph "IPC Communication Layer"
IPC[IPC Manager<br/>BaseManager<br/>Port: 5002]
end
subgraph "Model Server Layer"
ES[Embedding Server<br/>π₯ GPU Optimized]
SAP[Search Agent Pool<br/>π Browser Automation]
PM[Port Manager<br/>π Port: 9000-9999]
end
subgraph "Embedding Services"
ES --> EM[SentenceTransformer<br/>all-MiniLM-L6-v2<br/>πΎ ThreadPoolExecutor]
ES --> CS[Cosine Similarity<br/>π― Top-K Matching]
end
subgraph "Search Agents"
SAP --> YTA[Yahoo Text Agents<br/>π Max 20 tabs/agent]
SAP --> YIA[Yahoo Image Agents<br/>πΌοΈ Max 20 tabs/agent]
YTA --> P1[Playwright Instance 1<br/>Port: 9XXX]
YTA --> P2[Playwright Instance 2<br/>Port: 9XXX]
YIA --> P3[Playwright Instance 3<br/>Port: 9XXX]
YIA --> P4[Playwright Instance 4<br/>Port: 9XXX]
end
subgraph "External Services"
YS[Yahoo Search Results]
YI[Yahoo Image Search]
WEB[Web Scraping]
YT[YouTube Transcripts<br/>πΉ Rate Limited: 20/min]
LLM[Pollinations LLM API<br/>π€ AI Synthesis]
end
subgraph "Request Processing"
RQ[Request Queue<br/>π¦ Max: 100]
PS[Processing Semaphore<br/>π¦ Max: 15 concurrent]
AR[Active Requests<br/>π Tracking & Stats]
end
A1 -.->|TCP:5002<br/>authkey| IPC
A2 -.->|TCP:5002<br/>authkey| IPC
A3 -.->|TCP:5002<br/>authkey| IPC
A1 --> RQ
A2 --> RQ
A3 --> RQ
RQ --> PS
PS --> AR
IPC <--> ES
IPC <--> SAP
SAP <--> PM
P1 --> YS
P2 --> YS
P3 --> YI
P4 --> YI
A1 --> WEB
A2 --> WEB
A3 --> WEB
A1 --> YT
A2 --> YT
A3 --> YT
A1 --> LLM
A2 --> LLM
A3 --> LLM
classDef serverNode fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef workerNode fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
classDef modelNode fill:#fff3e0,stroke:#e65100,stroke-width:3px
classDef externalNode fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
classDef browserNode fill:#fce4ec,stroke:#880e4f,stroke-width:2px
classDef queueNode fill:#f1f8e9,stroke:#33691e,stroke-width:2px
class ES,EM,CS modelNode
class A1,A2,A3 workerNode
class IPC serverNode
class YS,YI,WEB,YT,LLM externalNode
class SAP,YTA,YIA,P1,P2,P3,P4,PM browserNode
class RQ,PS,AR queueNode
-
π Request Processing Pipeline
- Async request queue (max 100 pending)
- Processing semaphore (max 15 concurrent)
- Active request tracking with statistics
-
π Browser Automation Pool
- Pre-warmed Playwright agents for immediate use
- Automatic agent rotation after 20 tabs
- Dynamic port allocation (9000-9999 range)
- Separate pools for text and image search
-
π§ IPC Embedding System
- Single GPU instance with ThreadPoolExecutor
- Thread-safe operations with semaphore control
- Cosine similarity for semantic matching
-
π Performance Monitoring
- Real-time request statistics
- Agent pool status tracking
- Port usage monitoring
- Health check endpoints
- π― Single GPU Instance: Only one embedding model loads on GPU, reducing memory usage
- β‘ Concurrent Processing: Multiple app workers can use embeddings simultaneously
- π Load Balancing: Requests are queued and processed efficiently
- π° Cost Optimization: Significantly reduced GPU memory requirements
- π Horizontal Scaling: Easy to add more app workers without additional GPU load
- π‘οΈ Fault Isolation: Embedding server failures don't crash app workers
- π§ Hot Reloading: Can restart app workers without reloading heavy embedding model
- Accepts user queries and processes them using web search, YouTube transcript analysis, and AI-powered synthesis.
- Produces comprehensive Markdown responses with inline citations and images.
- Handles complex, multi-step queries with iterative tool use.
- Scrapes main text and images from selected URLs (after evaluating snippets).
- Avoids scraping irrelevant or search result pages.
- Extracts metadata and transcripts from YouTube videos.
- Presents transcripts as clean, readable text.
- Uses Pollinations API for LLM-based planning and synthesis.
- Iteratively calls tools (web search, scraping, YouTube, timezone) as needed.
- Gathers evidence from multiple sources before answering.
- Exposes
/search(JSON) and/search/sse(Server-Sent Events) endpoints. - Supports both GET and POST requests, including OpenAI-compatible message format.
- CORS enabled for web front-ends.
- Uses async and thread pools for parallel web scraping and YouTube processing.
- Handles multiple requests efficiently.
-
app.py
Main Quart API server. Handles/search,/search/sse, and OpenAI-compatible/v1/chat/completionsendpoints. Manages async event streams and JSON responses. -
searchPipeline.py
Core pipeline logic. Orchestrates tool calls (web search, scraping, YouTube, timezone), interacts with Pollinations LLM API, and formats Markdown answers with sources and images.
-
modelServer.py
The new IPC-based embedding server that runs on port 5002. Handles SentenceTransformer model, FAISS indexing, and web search with embeddings. -
embeddingClient.py
Client module for connecting to the embedding server. Provides thread-safe access with automatic reconnection. -
textEmbedModel.py
Updated legacy module with backward compatibility. Automatically switches between IPC and local models based on configuration. -
start_embedding_server.py
Startup script for launching the embedding server with proper monitoring and graceful shutdown. -
test_embedding_ipc.py
Test suite for validating IPC connection and embedding functionality.
clean_query.py,search.py,scrape.py,getYoutubeDetails.py,tools.py,getTimeZone.py: Tool implementations for query cleaning, web search, scraping, YouTube, and timezone handling..env: Environment variables for API tokens and model config.requirements.txt: Python dependencies.Dockerfile,docker-compose.yml: Containerization and deployment.
- Python 3.12
- Install dependencies:
pip install -r requirements.txt
- Set up
.envwith required API tokens.
# Terminal 1: Start the embedding server
cd search/PRODUCTION
python start_embedding_server.pyThe embedding server will start on port 5002 and load the SentenceTransformer model onto available GPU.
# Terminal 2: Test the embedding server
python test_embedding_ipc.py# Terminal 3: Start first app worker
cd src
python app.py
# Terminal 4: Start additional workers on different ports
PORT=5001 python app.py
PORT=5002 python app.py- Embedding Server: Monitor GPU usage and active operations through logs
- App Workers: Each worker connects independently to the embedding server
- Health Check: Use the test script to verify IPC connectivity
Set environment variables:
# Enable/disable IPC embedding (default: true)
export USE_IPC_EMBEDDING=true
# Embedding server configuration
export EMBEDDING_SERVER_HOST=localhost
export EMBEDDING_SERVER_PORT=5002If the embedding server is unavailable, the system automatically falls back to local embedding models, ensuring service continuity.
# Disable IPC and use local models
export USE_IPC_EMBEDDING=false
python app.py- API available at
http://127.0.0.1:5000/search
curl -X POST http://localhost:5000/search \
-H "Content-Type: application/json" \
-d '{"query": "What are the latest trends in AI research? Summarize this YouTube video https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'curl -X POST http://localhost:5000/search \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Tell me about the history of the internet."}
]
}'curl -N -X POST http://localhost:5000/search/sse \
-H "Content-Type: application/json" \
-d '{"query": "weather in London tomorrow"}'-
/searchPOST/GET- Accepts
{"query": "..."} - Also supports OpenAI-style
{"messages": [...]}
-
/search/ssePOST- Streams results as Server-Sent Events (SSE)
-
/v1/chat/completions- OpenAI-compatible chat completions endpoint
Set environment variables in .env:
# Pollinations API
TOKEN=your_pollinations_token
MODEL=your_model_name
REFERRER=your_referrer
# IPC Embedding Configuration
USE_IPC_EMBEDDING=true
EMBEDDING_SERVER_HOST=localhost
EMBEDDING_SERVER_PORT=5002
# Worker Configuration
PORT=5000
MAX_CONCURRENT_OPERATIONS=3- Embedding Server: Adjust
MAX_CONCURRENT_OPERATIONSinmodelServer.py - App Workers: Set different
PORTvalues for multiple workers - Memory Management: Configure batch sizes and GPU memory fractions as needed
- Single embedding model instance shared across all workers
- Automatic GPU memory cleanup after operations
- Configurable batch sizes for large document processing
- Semaphore-based operation limiting
- Thread-safe GPU operations
- Automatic retry logic with exponential backoff
- LRU cache for frequently accessed embeddings
- Connection pooling for web requests
- Async processing for I/O operations
-
/searchPOST/GET- Accepts
{"query": "..."} - Also supports OpenAI-style
{"messages": [...]}
-
/search/ssePOST- Streams results as Server-Sent Events (SSE)
-
/v1/chat/completions- OpenAI-compatible chat completions endpoint
/health- App worker health status/embedding/health- Embedding server connectivity status/embedding/stats- Active operations and performance metrics
# Build and run with docker-compose
docker-compose up --build
# Scale app workers
docker-compose up --scale search-app=3# Example scaling configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: search-embedding-server
spec:
replicas: 1 # Single embedding server
selector:
matchLabels:
app: embedding-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: search-app-workers
spec:
replicas: 5 # Multiple app workers
selector:
matchLabels:
app: search-app-
Embedding Server Connection Failed
# Check if server is running netstat -tulpn | grep 5002 # Test connection python test_embedding_ipc.py
-
GPU Out of Memory
# Reduce batch size in modelServer.py # Lower MAX_CONCURRENT_OPERATIONS # Check GPU memory: nvidia-smi
-
High Latency
# Monitor active operations # Scale up app workers if needed # Check network latency between workers and embedding server
- Embedding server logs: Check
modelServer.pyoutput - App worker logs: Check individual
app.pyinstances - System metrics: Monitor GPU usage, memory, and CPU
- Connection health: Use test scripts regularly
- Backup Current Setup
- Install New Dependencies:
pip install loguru - Start Embedding Server:
python start_embedding_server.py - Test Connection:
python test_embedding_ipc.py - Update Environment: Set
USE_IPC_EMBEDDING=true - Restart App Workers: They will automatically use IPC
- Monitor Performance: Check logs and resource usage
Set USE_IPC_EMBEDDING=false to return to local embedding models.
cd search/PRODUCTION
python service_manager.py --workers 3 --port 5000cd search/PRODUCTION
.\start_services.ps1 -Workers 3 -BasePort 5000-
Start Embedding Server:
cd search/PRODUCTION python start_embedding_server.py -
Test Connection:
python test_embedding_ipc.py
-
Start App Workers:
cd src PORT=5000 python app.py & PORT=5001 python app.py & PORT=5002 python app.py &
- Search API:
http://localhost:5000/search - Health Check:
http://localhost:5000/health - Embedding Health:
http://localhost:5000/embedding/health - Embedding Stats:
http://localhost:5000/embedding/stats
- Relies on Pollinations API for LLM responses (subject to their rate limits).
- Requires internet connectivity for search and scraping.
- YouTube transcript extraction depends on third-party services.
- NEW: Embedding server requires sufficient GPU memory for optimal performance.
