Real-time voice agent with voice cloning. It listens on a virtual PulseAudio device, transcribes with Whisper, talks back through an LLM, and clones any voice with Qwen TTS. You can interrupt it mid-sentence and it'll stop and listen
You'll need Nix with flakes (it handles CUDA, PulseAudio, and all native deps), an NVIDIA GPU, and either Ollama running locally or an OpenRouter API key
nix develop
ollama serve &
ollama pull qwen3:8bEach profile lives in profiles/<name>/ and needs two files, a reference.wav with about 10 seconds of the target voice, and a profile.toml with the transcript and personality.
If you have an audio or video clip of someone talking, the prepare script will convert it to the right format and transcribe it for you:
python scripts/prepare_reference.py clip.mp4 --profile myvoiceThis converts to mono 24kHz wav, runs Whisper on it, and prints the transcript to paste into your profile.toml:
voice_transcript = "<transcript from prepare script>"
personality_prompt = """<how the agent should behave>"""The pipeline reads from a BotSpeaker monitor and writes to a BotMic sink. Set these up however you want, PipeWire virtual devices, pactl load-module, whatever. The names are configurable in config.toml
orange-agent --profile <name>You can pass a scenario prompt to set the context for the conversation:
orange-agent --profile <name> --scenario "casual voice call with friends"You can test a profile's voice directly without running the full pipeline. Pass text to speak, or let the LLM generate something in character:
python scripts/test_tts.py --profile <name> "text to speak"
python scripts/test_tts.py --profile <name> --generate --scenario "casual voice call"
python scripts/test_tts.py --profile <name> -o output.wav "save to file"Everything else lives in config.toml, LLM provider and model, audio device names, VAD sensitivity, timing thresholds. To use OpenRouter instead of Ollama:
[llm]
provider = "openrouter"
model = "google/gemini-2.0-flash-001"export OPENROUTER_API_KEY="sk-..."