Voicifire
React Native app that listens to short speech clips and tells you what's actually wrong with your delivery. Pace, clarity, filler density, tone.
I started Voicifire because I was practicing a talk and realized something: I had no idea how I actually sounded. Recording myself helped, but listening back was slow and I was a terrible judge of my own pace. I'd hit a section and think it was fine; a friend would listen and tell me I'd rushed it.
The premise of the app is small: a user records a short clip, the app extracts useful prosodic features, and a coaching layer surfaces three or four concrete things to change. Not "your tone could improve". Specifically: "your pace dropped from 165 words per minute to 220 between the second and third paragraph; the listener feels rushed there".
What's actually in the pipeline
A React Native UI for recording, playback, and session review. The recording side does on-device pre-processing for noise floor and silence trimming, because uploading two minutes of raw audio from a phone on cellular is a bad experience.
The inference layer extracts prosodic features: pace in words per minute over rolling windows, pitch range, filler-word density (counted with a small fine-tuned model), and a few clarity proxies derived from spectral entropy. A coaching engine maps deltas in those features to short, specific prompts. The user sees three or four bullets per session, not a wall of metrics.
History matters more than single-session scores. People plateau. The progress view shows trends across sessions, so you see whether your filler density is actually trending down or whether you just had a good day.
What was hard
Latency. The first version ran inference server-side and the round-trip felt sluggish. We moved most of the feature extraction on-device, kept the heavier classification layer in the cloud, and got the session-review screen down to roughly 800ms after recording stops. Not instant, but well under the user's patience threshold.
The filler-word model was the surprise. Generic speech-to-text under-counts "uh" and "um" because they're not lexically informative. We needed a small model trained specifically to catch them, and even then it's calibrated per accent. A French speaker's filler density looks different from a North American one. The current model handles both reasonably; it's not perfect.
What's open
The coaching prompts are still rule-based. They work, but a learned coaching layer (which prompts produce the most user improvement?) is the next thing I'd build. The data is there. The feedback loop just isn't tightened yet.