MisuJob - AI Job Search Platform MisuJob

Staff Machine Learning Engineer, Voice AI

Togetherai

San Francisco permanent

Posted: May 19, 2026

Interested in this position?

Create a free account to apply with AI-powered matching

Quick Summary

Drives the model serving layer for voice workloads, optimizing latency and throughput for production-grade inference engines like Whisper, Parakeet, and Kokoro.

Job Description

About the Role

Together AI is building the best inference infrastructure for voice applications. Our Voice AI platform powers production-grade, real-time voice agents and applications — serving speech-to-text and text-to-speech models with best-in-class latency and reliability.

We're looking for a Staff ML Engineer to drive the model serving layer for voice workloads. You'll work hands-on with inference engines like TRT-LLM and SGLang to optimize how we serve models like Whisper, Parakeet, Orpheus, and Kokoro — pushing latency and throughput to the frontier. You'll profile GPU utilization, design batching strategies for streaming audio, and ensure new model architectures can go from research to production quickly.

This is a foundational hire on a small, high-impact team. Voice inference has unique challenges — streaming audio, tokenization, real-time latency budgets — that require dedicated ML engineering focus. You'll shape how Together serves voice models as the industry moves from pipeline architectures (ASR → LLM → TTS) toward end-to-end speech-to-speech.

• Own the model serving stack that powers Together's voice platform across STT, TTS, and speech-to-speech.

• Work directly with state-of-the-art accelerators (H100s, H200s, B200s) to optimize voice model inference.

• Collaborate with model partners (Cartesia, Deepgram, Rime, and others) to bring their models to production on Together's infrastructure.

• Build quality evaluation frameworks that guide model selection for customers and inform the roadmap.

• Join a small, early-stage team with outsized impact on a fast-growing product area.

Responsibilities

• Own the voice inference roadmap end-to-end — define and execute the technical strategy for optimizing STT, TTS, and speech-to-speech models across Together's infrastructure, with a clear-eyed view of where the field is heading and how to position the platform ahead of it.

• Drive best-in-class inference performance — architect and implement systems targeting leading TTFB, throughput, and GPU utilization for voice workloads; set the performance bar others in the industry measure against, not just catch up to.

• Lead productionization of voice models at scale — design the serving architecture for serverless and dedicated endpoints, including batching strategies, streaming inference pipelines, and memory management tailored to real-time audio; own reliability and latency SLAs.

• Build the voice evaluation platform — design a rigorous, extensible evaluation framework covering WER across accents, languages, and noise conditions for STT; naturalness, latency, and pronunciation fidelity for TTS; establish the internal benchmark methodology that informs model selection and roadmap decisions.

• Shape the architecture for next-generation model support — anticipate and enable emerging model paradigms — audio-native LLMs, codec-based architectures (SNAC, Encodec), and end-to-end speech-to-speech systems — before they're mainstream, not after.

• Serve as the technical DRI for model partner integrations — lead deep collaboration with partners such as Cartesia, Deepgram, and Rime; own the full lifecycle from integration to optimization to ongoing performance accountability.

• Diagnose and resolve the hardest performance problems in the stack — conduct systematic profiling and root-cause analysis from GPU kernel behavior to framework-level bottlenecks; drive shipped improvements with documented, measurable impact.

• Influence platform architecture across the organization — partner with platform engineering leadership to ensure the serving layer is built for the latency and reliability demands of real-time voice APIs; your technical decisions should raise the ceiling for the whole team.

• Define and scale voice fine-tuning capabilities — lead the technical direction for enabling customers to fine-tune STT and TTS models on Together's infrastructure, establishing the primitives for differentiated voice experiences.

• Lay technical foundations for a category-defining product surface — architect systems with enough foresight that they support multiple new voice products with minimal rework; think in terms of platforms, not point solutions.

Requirements

• 8+ years of ML engineering experience, with a demonstrated focus on model serving, inference optimization, or ML infrastructure at production scale — including systems you've owned from design through live traffic.

• Deep, practical expertise in LLM serving engines (vLLM, SGLang, TensorRT-LLM, or equivalent) — you've modified engine internals, debugged edge cases under load, and contributed improvements back; you don't stop at the API surface.

• Expert-level Python and PyTorch proficiency, with a strong command of GPU optimization — CUDA kernels, memory hierarchies, profiling toolchains — and a track record of turning that knowledge into shipped latency or throughput wins.

• Proven system design judgment — you've made architectural decisions that held up at scale and influenced how a team or platform evolved; you can articulate the tradeoffs you made and why.

• Strong technical leadership — you operate with high autonomy, define the right problems before solving them, and raise the bar for engineering quality around you without requiring process overhead.

• Sharp product intuition for developer tooling — you understand what voice application developers actually need to ship great products, and you let that shape your technical priorities, not just the other way around.

• Proven ability to move fast in ambiguous environments — you've thrived on early-stage or platform teams where scope is wide, ownership is deep, and the roadmap you build is the one you execute.

• Strong foundation in speech and audio ML (ASR/TTS architectures, audio signal processing) — directly relevant experience is strongly preferred; exceptional ML engineering fundamentals with genuine curiosity about the domain is also considered.

• Familiarity with audio codec and tokenization schemes (SNAC, Encodec, DAC) is a meaningful plus at this level.

• Experience training or fine-tuning speech models at scale is a significant advantage.

• Bachelor's or Master's in Computer Science, Electrical Engineering, or related field — or equivalent depth demonstrated through your work.

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $220,000 - $280,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https://www.together.ai/privacy

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply