Quick Summary

Drives the model serving layer for voice workloads, optimizing latency and throughput for production-grade inference engines like Whisper, Parakeet, and Kokoro.

Required Skills

ML Engineering Excellence Voice AI Inference Stacks Model Serving GPU Optimization Streaming Audio Time-aware Quality Evaluation Voice UI Batching Strategies Model Partnerships Next-generation Model Support

Job Description

About the Role

Together AI is building the best inference infrastructure for voice applications. Our Voice AI platform powers production-grade, real-time voice agents and applications — serving speech-to-text and text-to-speech models with best-in-class latency and reliability.

We're looking for a Staff ML Engineer to drive the model serving layer for voice workloads. You'll work hands-on with inference engines like TRT-LLM and SGLang to optimize how we serve models like Whisper, Parakeet, Orpheus, and Kokoro — pushing latency and throughput to the frontier. You'll profile GPU utilization, design batching strategies for streaming audio, and ensure new model architectures can go from research to production quickly.

This is a foundational hire on a small, high-impact team. Voice inference has unique challenges — streaming audio, tokenization, real-time latency budgets — that require dedicated ML engineering focus. You'll shape how Together serves voice models as the industry moves from pipeline architectures (ASR → LLM → TTS) toward end-to-end speech-to-speech.

• Own the model serving stack that powers Together's voice platform across STT, TTS, and speech-to-speech.

• Work directly with state-of-the-art accelerators (H100s, H200s, B200s) to optimize voice model inference.

• Collaborate with model partners (Cartesia, Deepgram, Rime, and others) to bring their models to production on Together's infrastructure.

• Build quality evaluation frameworks that guide model selection for customers and inform the roadmap.

• Join a small, early-stage team with outsized impact on a fast-growing product area.

Responsibilities

• Own the voice inference roadmap end-to-end — define and execute the technical strategy for optimizing STT, TTS, and speech-to-speech models across Together's infrastructure, with a clear-eyed view of where the field is heading and how to position the platform ahead of it.

• Drive best-in-class inference performance — architect and implement systems targeting leading TTFB, throughput, and GPU utilization for voice workloads; set the performance bar others in the industry measure against, not just catch up to.

• Lead productionization of voice models at scale — design the serving architecture for serverless and dedicated endpoints, including batching strategies, streaming inference pipelines, and memory management tailored to real-time audio; own reliability and latency SLAs.

• Build the voice evaluation platform — design a rigorous, extensible evaluation framework covering WER across accents, languages, and noise conditions for STT; naturalness, latency, and pronunciation fidelity for TTS; establish the internal benchmark methodology that informs model selection and roadmap decisions.

• Shape the architecture for next-generation model support — anticipate and enable emerging model paradigms — audio-native LLMs, codec-based architectures (SNAC, Encodec), and end-to-end speech-to-speech systems — before they're mainstream, not after.

• Serve as the technical DRI for model partner integrations — lead deep collaboration with partners such as Cartesia, Deepgram, and Rime; own the full lifecycle from integration to optimization to ongoing performance accountability.

• Diagnose and resolve the hardest performance problems in the stack — conduct systematic profiling and root-cause analysis from GPU kernel behavior to framework-level bottlenecks; drive shipped improvements with documented, measurable impact.

• Influence platform architecture across the organization — partner with platform engineering leadership to ensure the serving layer is built for the latency and reliability demands of real-time voice APIs; your technical decisions should raise the ceiling for the whole team.

• Define and scale voice fine-tuning capabilities — lead the technical direction for enabling customers to fine-tune STT and TTS models on Together's infrastructure, establishing the primitives for differentiated voice experiences.

• Lay technical foundations for a category-defining product surface — architect systems with enough foresight that they support multiple new voice products with minimal rework; think in terms of platforms, not point solutions.

Requirements

• 8+ years of ML engineering experience, with a demonstrated focus on model serving, inference optimization, or ML infrastructure at production scale — including systems you've owned from design through live traffic.

• Deep, practical expertise in LLM serving engines (vLLM, SGLang, TensorRT-LLM, or equivalent) — you've modified engine internals, debugged edge cases under load, and contributed improvements back; you don't stop at the API surface.

• Expert-level Python and PyTorch proficiency, with a strong command of GPU optimization — CUDA kernels, memory hierarchies, profiling toolchains — and a track record of turning that knowledge into shipped latency or throughput wins.

• Proven system design judgment — you've made architectural decisions that held up at scale and influenced how a team or platform evolved; you can articulate the tradeoffs you made and why.

• Strong technical leadership — you operate with high autonomy, define the right problems before solving them, and raise the bar for engineering quality around you without requiring process overhead.

• Sharp product intuition for developer tooling — you understand what voice application developers actually need to ship great products, and you let that shape your technical priorities, not just the other way around.

• Proven ability to move fast in ambiguous environments — you've thrived on early-stage or platform teams where scope is wide, ownership is deep, and the roadmap you build is the one you execute.

• Strong foundation in speech and audio ML (ASR/TTS architectures, audio signal processing) — directly relevant experience is strongly preferred; exceptional ML engineering fundamentals with genuine curiosity about the domain is also considered.

• Familiarity with audio codec and tokenization schemes (SNAC, Encodec, DAC) is a meaningful plus at this level.

• Experience training or fine-tuning speech models at scale is a significant advantage.

• Bachelor's or Master's in Computer Science, Electrical Engineering, or related field — or equivalent depth demonstrated through your work.

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $220,000 - $280,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https://www.together.ai/privacy

Staff Machine Learning Engineer, Voice AI

Interested in this position?

Quick Summary

Required Skills

Job Description

Why Apply Through MisuJob?

Frequently Asked Questions

How do I apply for this position?

Is MisuJob free for job seekers?

How does AI matching work?

Can I apply to jobs in other countries?

Ready to Apply?