Work / mobile

Voicifire

React Native app that listens to short speech clips and tells you what's actually wrong with your delivery. Pace, clarity, filler density, tone.

MobileAIAudioSpeech

Role

Creator & Lead Engineer

Date

2025-02-01

Read time

2 min read

Stack

4 techs

I started Voicifire because I was practicing a talk and realized something: I had no idea how I actually sounded. Recording myself helped, but listening back was slow and I was a terrible judge of my own pace. I'd hit a section and think it was fine; a friend would listen and tell me I'd rushed it.

The premise of the app is small: a user records a short clip, the app extracts useful prosodic features, and a coaching layer surfaces three or four concrete things to change. Not "your tone could improve". Specifically: "your pace dropped from 165 words per minute to 220 between the second and third paragraph; the listener feels rushed there".

What's actually in the pipeline

A React Native UI for recording, playback, and session review. The recording side does on-device pre-processing for noise floor and silence trimming, because uploading two minutes of raw audio from a phone on cellular is a bad experience.

The inference layer extracts prosodic features: pace in words per minute over rolling windows, pitch range, filler-word density (counted with a small fine-tuned model), and a few clarity proxies derived from spectral entropy. A coaching engine maps deltas in those features to short, specific prompts. The user sees three or four bullets per session, not a wall of metrics.

History matters more than single-session scores. People plateau. The progress view shows trends across sessions, so you see whether your filler density is actually trending down or whether you just had a good day.

What was hard

Latency. The first version ran inference server-side and the round-trip felt sluggish. We moved most of the feature extraction on-device, kept the heavier classification layer in the cloud, and got the session-review screen down to roughly 800ms after recording stops. Not instant, but well under the user's patience threshold.

The filler-word model was the surprise. Generic speech-to-text under-counts "uh" and "um" because they're not lexically informative. We needed a small model trained specifically to catch them, and even then it's calibrated per accent. A French speaker's filler density looks different from a North American one. The current model handles both reasonably; it's not perfect.

What's open

The coaching prompts are still rule-based. They work, but a learned coaching layer (which prompts produce the most user improvement?) is the next thing I'd build. The data is there. The feedback loop just isn't tightened yet.

Stack

React NativeAudio MLTypeScriptAI