Voice mode and the face you give Mirror

How the listening, processing, and speaking states work, and how to give Mirror a face that mirrors your own using your phone camera.

4 min read · updated Apr 19, 2026

Mirror is voice-first by default. When you tap into voice mode the agent moves through four discrete states (idle, listening, processing, speaking) with explicit visual cues for each. The lip-sync on the avatar is driven by the same audio frequencies that the TTS engine streams to your speakers, so the mouth movement actually matches the sound rather than running on a generic loop.

The four states in detail

Idle: the avatar breathes (subtle scale animation), eyes occasionally drift and blink (random 2–6s intervals), and a faint particle field drifts around the head. No audio capture. No network activity.
Listening: a microphone permission has been granted and the SpeechRecognition session is active. The avatar leans slightly forward, the particle field pulses with the input volume, and a soft cyan ring around the head mirrors your speaking amplitude in real time.
Processing: the transcript has been sent to Mirror and the agent is composing a response. The avatar relaxes, the particle field slows, and a thin shimmer moves around the perimeter of the head to indicate "thinking" without committing to a fake-progress bar.
Speaking: the TTS engine is streaming synthesised audio and the lip-sync engine is driving the mouth morph targets from the AnalyserNode's frequency bins. As soon as the audio queue empties the state returns to idle.

Speech-to-text path

Audio capture and transcription happen entirely in your browser when possible. Browsers that ship the Web Speech API's SpeechRecognition interface natively (Safari, Chromium) transcribe locally with continuous mode and interim results, so the transcript appears as you speak. Browsers that do not ship SpeechRecognition fall back to recording audio and sending the blob to a server-side transcription endpoint; you see a small badge in the voice toolbar indicating "server STT" so the latency change is not surprising.

Text-to-speech and lip-sync

Mirror streams its response as text and as audio in parallel. The text appears in the transcript immediately, and the audio is synthesised by a low-latency TTS provider and played back through the Web Audio API. An AnalyserNode in the graph exposes the live frequency spectrum, which the lip-sync engine reduces to morph-target weights for the visible face (mouthOpen, jawOpen, lipPucker, lipSpread) every frame based on the audio that is currently playing.

Giving Mirror your face

On a first visit Mirror shows you a generic point-cloud face. If you want, you can capture your own face from the avatar settings panel: Mirror asks for camera access, a standard on-device face-landmark model detects facial landmarks in real time, and the captured landmark cloud becomes the geometry that drives the avatar going forward. The capture step typically takes a few seconds with the user looking straight at the camera so the front-facing landmarks are stable.

The captured face is stored in your browser's local storage. On subsequent visits the platform reads from local storage and reconstructs the avatar geometry without any network round-trip. There is no server-side copy of your face data anywhere.

Reduced motion support

When you set prefers-reduced-motion in your operating system, Mirror disables the breathing animation, the eye drift, the brow raises, and the particle field. Lip-sync stays on because it is functional rather than ornamental. The four states still display, but transitions become instant rather than animated.

Failure modes and what to do

Microphone permission denied: voice mode shows a permission prompt with a deep link to your browser's site settings. Text mode remains fully usable.
Camera permission denied during face capture: the capture step fails gracefully and the default face is retained.
Browser without in-browser speech recognition: the platform automatically uses the server-side fallback path.
Voice provider outage: voice mode degrades to text-only with a non-blocking toast. The conversation continues in real time.
Slow connection: the platform buffers briefly before resuming playback rather than breaking up audio into stutter.

Ask Mirror about this for your own genome

How does voice mode work in detail, and what do I need to do to make you look like me?