~/case-studies/flowcode

flowcode

The voice of Claude Code — a native macOS app that reads replies aloud and takes dictation, fully on-device.

Solo — concept, architecture, Swift + Python, design2026Open source (MIT) · public beta

Swift 6macOSMetalKokoro TTSWhisper STTCore AudioAccessibility API

GitHub — flowcode-app voicemode fork upstream voicemode

flowcode is a tiny native macOS menu-bar app that gives Claude Code a voice. The moment a new reply lands, it reads the prose aloud with a local text-to-speech engine — so you hear when Claude finishes, asks a question, or needs your call, without babysitting the terminal.

When it's your turn, you hold Right Option (⌥) and talk; your words are transcribed locally and pasted into the focused app — and you press Enter. It also reads the Claude desktop app (Chat, Cowork, Code). Everything runs on-device: no account, no cloud, no telemetry.

⌥

hold-to-talk dictation

MIT

open source · public beta

the problem

Living in Claude Code for long sessions means two kinds of waiting: watching the terminal for a reply to finish, and typing the next prompt when your hands or eyes are busy. Voice for coding agents has been built many times — but it tends to feel turn-based and clunky, a loop bolted onto a terminal rather than a finished product.

I wanted something that felt ambient: Claude simply speaks when it has something to say, and I can answer out loud without breaking flow — while staying completely private and never letting a voice command do anything irreversible on its own.

how it works

flowcode never touches Claude Code's internals. Claude Code already writes a live JSONL transcript of every session to disk — flowcode just tails that file, extracts each new assistant message, strips tool-call narration and code blocks so only the prose is spoken, and sends it to a local Kokoro TTS service. For the Claude desktop app, which has no transcript file, it reads the on-screen text through the macOS Accessibility tree instead.

Dictation is the mirror image: holding ⌥ captures the mic, a local Whisper service transcribes it, and the text is pasted into whatever app is focused. The mic is released the instant you let go. Both engines are localhost services, so nothing ever leaves the Mac.

$data-flow

Claude Code JSONL ─tail─┐
                        ├─▶ flowcode ─HTTP─▶ Kokoro :8880 ─▶ 🔊 + orb
Claude Desktop  AX ─────┘
        ⌥ hold ──▶ mic ──HTTP──▶ Whisper :2022 ──paste──▶ focused app

engineering decisions

The product is split into two tiers. Model B — what ships today — is read-aloud plus dictation, written entirely in Swift with no Python core and no socket: fast, small, dependency-light. Model A is an experimental real-time layer (barge-in, streaming TTS, semantic endpointing) that runs a forked voicemode Python core; it is wired in but default-off.

Audio-reactive orb rendered live with a runtime-compiled Metal shader rather than a bitmap, with Reduce Motion and Reduce Transparency fallbacks.
Clause splitting chunks long replies into sentences and synthesizes the next while the current one plays, so the first sentence starts speaking sooner instead of buffering the whole reply.
Barge-in as a pure state machine (WebRTC VAD + an energy gate tuned to the assistant's own TTS), so you can talk over it with ~100–150 ms kill latency — no echo-cancellation hardware required.
Claude Desktop support via the Accessibility tree, anchoring on per-turn headings and de-duplicating reflowed text so nothing is read twice.
Upstream-friendly fork: every change to voicemode sits behind a feature flag, so it rebases cleanly on the original and stays byte-for-byte identical when off.

privacy & safety

Two principles are non-negotiable. First, everything is local — TTS and STT are localhost services, there is no account and no network round-trip. Second, voice can only propose, never commit: flowcode never presses Enter for you. Higher-risk actions go through a confirmation gate that requires a real, non-voice gesture (a click, a hotkey, or Touch ID), so ambient sound, a colleague's voice, or the app's own TTS can never trigger anything — and a timeout always defaults to deny.

status & what's next

flowcode is open source under MIT and in public beta (v0.3.0). It installs in an unusual way — agent-driven: you point Claude Code at the repo and it runs the setup end-to-end (voice services, build, signing, permissions). The next big lever is a notarized one-click download, which is what would take it from a few hundred build-from-source users to a broad audience. Built on top of voicemode by Mike Bailey.

back to homepage

$the problem

$how it works

$engineering decisions

$privacy & safety

$status & what's next

the problem

how it works

engineering decisions

privacy & safety

status & what's next