WikiTalk: Local Conversational Historian

WikiTalk is an offline, conversational AI assistant that allows users to talk about history, science, and culture using a local copy of Wikipedia. It understands natural questions, supports follow-up questions and contextual discussion, and speaks answers aloud via Piper TTS.

Features

Offline knowledge: Runs entirely from a local Wikipedia dataset (FineWiki Parquet files)
Natural conversation: Multi-turn dialogue with context retention and topic continuity
Voice interaction: Speaks answers using Piper TTS
Grounded knowledge: Uses retrieved, cited chunks from Wikipedia to reduce hallucinations
Mac-native: Optimized for Apple Silicon performance

Quick Start

1. Setup

# Install dependencies
python setup.py

# Or manually install requirements
pip install -r requirements.txt

2. Process Wikipedia Data

# This will create SQLite and FAISS indexes from the parquet files
python data_processor.py

Note: This step can take several hours depending on your system. The process will:

Parse all parquet files in finewiki/data/enwiki/
Create text chunks with overlap
Build SQLite FTS5 index for BM25 search
Generate embeddings and create FAISS index for dense retrieval

3. Start LLM Server

You need a local LLM server running. Options:

Option A: LM Studio

Download LM Studio from https://lmstudio.ai/
Load a model like Qwen2.5-14B-Instruct or Llama-3.1-8B-Instruct
Start the local server (usually on port 1234)

Option B: llama.cpp

# Download and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download a model (example with Qwen2.5-14B)
# Start server
./server -m models/qwen2.5-14b-instruct.gguf --port 1234

4. Optional: Setup TTS

For voice output, download Piper TTS voices:

# Create voices directory
mkdir -p voices

# Download a voice (example)
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/amy/medium/en_US-amy-medium.onnx -O voices/en_US-amy-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/amy/medium/en_US-amy-medium.onnx.json -O voices/en_US-amy-medium.onnx.json

5. Run WikiTalk

python wikitalk.py

Usage

Once running, you can:

Ask questions: "Tell me about the Meiji Restoration"
Follow up: "And how did it affect Korea?"
Get voice responses (if TTS is configured)
Type clear to start a new conversation
Type quit to exit

Architecture

User Input → Query Rewriter → Hybrid Retrieval → LLM → Response → TTS → Audio
                ↓
        Conversation Memory ← SQLite + FAISS Indexes

Components

Data Processor: Parses parquet files and creates search indexes
Hybrid Retriever: Combines BM25 (SQLite FTS5) and dense (FAISS) search
LLM Client: Interfaces with local LLM for response generation
TTS Client: Converts text to speech using Piper or macOS say
Conversation Manager: Handles multi-turn dialogue context

Configuration

Edit config.py to customize:

Data paths and model settings
Retrieval parameters (top-k, chunk size)
LLM server URL and model
TTS voice settings

Requirements

Python 3.8+
8GB+ RAM (for embeddings and FAISS index)
30GB+ disk space (for full English Wikipedia)
Local LLM server (LM Studio or llama.cpp)
Optional: Piper TTS for voice output

Troubleshooting

Data Processing Issues

Ensure parquet files are in finewiki/data/enwiki/
Check available disk space (30GB+ needed)
Monitor memory usage during processing

LLM Connection Issues

Verify LLM server is running on correct port
Check LLM_URL in config.py
Test with: curl http://localhost:1234/v1/models

TTS Issues

Check Piper installation: which piper
Verify voice files in voices/ directory
Falls back to macOS say command if Piper unavailable

Performance

Retrieval: <1 second for most queries
Total response time: <10 seconds
Memory usage: ~8GB for full English Wikipedia
Storage: ~30GB for complete dataset

License

This project uses Wikipedia data under CC BY-SA 4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
finewiki		finewiki
.gitignore		.gitignore
CHECKPOINT.md		CHECKPOINT.md
DEBUG_PERFORMANCE.md		DEBUG_PERFORMANCE.md
EMBEDDING_IMPLEMENTATION.md		EMBEDDING_IMPLEMENTATION.md
EMBEDDING_READY.md		EMBEDDING_READY.md
EMBEDDING_SEARCH.md		EMBEDDING_SEARCH.md
FINAL_STATUS.md		FINAL_STATUS.md
GPU_WINDOWS_GUIDE.md		GPU_WINDOWS_GUIDE.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
PERFORMANCE_DIAGNOSTICS.md		PERFORMANCE_DIAGNOSTICS.md
PERFORMANCE_GUIDE.md		PERFORMANCE_GUIDE.md
PERFORMANCE_TOOLS.md		PERFORMANCE_TOOLS.md
QUICKSTART_EMBEDDINGS.md		QUICKSTART_EMBEDDINGS.md
QUICK_OPTIMIZATIONS.md		QUICK_OPTIMIZATIONS.md
README.md		README.md
RUN_PROFILER.sh		RUN_PROFILER.sh
START_HERE.md		START_HERE.md
SYSTEM_STATUS.md		SYSTEM_STATUS.md
build_embeddings.py		build_embeddings.py
compare_index_types.py		compare_index_types.py
config.py		config.py
convert_faiss_index.py		convert_faiss_index.py
data_processor.py		data_processor.py
demo.py		demo.py
diagnose_lm_studio.py		diagnose_lm_studio.py
llm_client.py		llm_client.py
optimize_db.py		optimize_db.py
profile_wikitalk.py		profile_wikitalk.py
requirements.txt		requirements.txt
retriever.py		retriever.py
setup.py		setup.py
test_data_processor.py		test_data_processor.py
test_large_db.py		test_large_db.py
test_llm_only.py		test_llm_only.py
test_retriever.py		test_retriever.py
test_retriever_quick.py		test_retriever_quick.py
test_simple_data.py		test_simple_data.py
test_simple_retriever.py		test_simple_retriever.py
test_wikitalk.py		test_wikitalk.py
tts_client.py		tts_client.py
verify_gpu.py		verify_gpu.py
wikitalk.py		wikitalk.py
wikitalk_prd.md		wikitalk_prd.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WikiTalk: Local Conversational Historian

Features

Quick Start

1. Setup

2. Process Wikipedia Data

3. Start LLM Server

4. Optional: Setup TTS

5. Run WikiTalk

Usage

Architecture

Components

Configuration

Requirements

Troubleshooting

Data Processing Issues

LLM Connection Issues

TTS Issues

Performance

License

About

Uh oh!

Releases

Packages

Languages

jasontitus/wikitalk-conversation

Folders and files

Latest commit

History

Repository files navigation

WikiTalk: Local Conversational Historian

Features

Quick Start

1. Setup

2. Process Wikipedia Data

3. Start LLM Server

4. Optional: Setup TTS

5. Run WikiTalk

Usage

Architecture

Components

Configuration

Requirements

Troubleshooting

Data Processing Issues

LLM Connection Issues

TTS Issues

Performance

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages