Skip to content
AntorLet's Talk
Portfolio /AI Products

Voxly (internal venture)

Shipped a sub-second-latency Bangla voice-to-text product with 92% word accuracy in 14 weeks.

Voxly — Bangla voice-to-text MVP launch

  • Services

    AI Solutions, Web Development, Brand Identity

  • Timeline

    14 weeks (concept → public beta)

  • Year

    2024

  • Role

    Founder + Engineering lead

  • Team size

    4 (1 ML, 2 engineers, 1 designer)

The Challenge

What needed solving.

Bangla voice processing has been systematically under-invested by the major AI labs. The accuracy you can get from off-the-shelf APIs in 2024 is well below what English-language speakers experience. For 230 million Bangla speakers, this is a daily friction.

The challenge wasn't whether the technology could work — by 2024 the underlying acoustic models were strong enough. It was whether we could productize a real-time streaming product around them with the latency, reliability, and developer experience that would actually get used.

The Approach

Four-step playbook for this engagement.

  1. Step 1

    Eval harness first

    Before any model selection, we built an evaluation pipeline: 200 hours of transcribed Bangla speech across regional accents, real-world noise conditions, and conversational cadences. Every model we tested ran against this corpus before we made any decisions.

  2. Step 2

    Streaming architecture

    Designed a streaming pipeline targeting <800ms latency on cold-start, <300ms on warm. Rejected three architecture sketches before settling on one that hit the budget without sacrificing accuracy.

  3. Step 3

    Productize the API

    Built developer SDKs for JavaScript, Python, and Go. Sample apps for the three highest-value use cases. Documentation that a new developer can ship against in under 10 minutes.

  4. Step 4

    Closed beta

    Recruited 12 developer-side beta users across podcast, journalism, and accessibility. Their feedback shaped the second iteration of the streaming protocol and the auth flow.

Visual Journey

Selected frames from the project.

Real images land in Phase 12 / 13 with the Sanity content load. Placeholders below preserve the layout.

  • Desktop transcription UI
  • Developer dashboard
  • Streaming architecture diagram
  • Beta user research
Results

What changed.

92%Bangla word accuracy
<800msStreaming latency
14 weeksConcept → beta
12Beta partners onboarded
Tech Stack

Tools used.

PyTorchFastAPINext.jsWebSocketsAWSPostgres
Your project here

Want a ai products case study like this?

If this is the shape of work you need shipped, the next step is a discovery call.