A voice assistant that holds natural, phone-style conversations, it listens, understands, and replies in lifelike speech, and remembers context across the conversation. Built on a split architecture so it stays fast and still has memory. Built solo, end to end.
The goal was to handle real conversations by voice, not text. But off-the-shelf voice bots fail one of two ways: they're too slow to feel human, or they're fast but forget everything the moment the topic moves on. Neither is usable for a real assistant.
Split the work. A fast path handles the live conversation so replies feel instant, while a separate path maintains memory and context in the background. The caller gets speed and continuity at the same time, without the trade-off.
The caller speaks naturally. Their audio is streamed and converted to text in real time so the system can start responding without waiting for them to finish.
A low-latency language model generates the reply for the live turn. This path is optimized purely for speed, so the conversation never feels like it's lagging or buffering.
The response is converted back into natural, lifelike speech and played to the caller, completing the turn in a way that feels like talking to a person.
Separately and asynchronously, the system stores and retrieves longer-term context and knowledge, so the assistant remembers earlier points without ever slowing the live reply.
The whole flow is orchestrated end to end, capture, response, speech, and memory, with reliable routing, retries, and integration into the surrounding systems.
Speed and memory at once. The live conversation runs on a fast path while context is maintained in parallel, so the assistant is both quick and aware, no trade-off between the two.
Voice only feels natural under a tight latency budget. Streaming transcription and a speed-tuned response path keep replies fast enough to feel like a real conversation.
Instead of forcing a choice, I split the system: a fast live path and an async memory path. The key insight that made it both responsive and context-aware.
Detecting when the caller has finished speaking, handling pauses and interruptions, so the assistant responds at the right moment, not over the top of them.
Speech and AI services fail and throttle. Retries and fallbacks keep a live call from breaking when one step has a bad moment.
The assistant connects to the surrounding tools and workflows so conversations actually do something, not just talk.
Voice pipeline, response logic, memory layer, and orchestration, designed, built, and delivered by one engineer as a working MVP.
I turn messy business problems into reliable AI systems, voice agents, content platforms, RAG, and automation, designed and shipped solo.