ARCHIVED
This job listing has been archived and is no longer accepting applications.
MisuJob - AI Job Search Platform MisuJob

Data Scientist

Sciforium

San Francisco, California, United States permanent

Posted: January 7, 2026

Interested in this position?

Create a free account to apply with AI-powered matching

Job Description

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering frontier AI models and real-time applications.

Role Overview

Sciforium is seeking a highly technical and visionary Data Scientist to lead the strategy, creation, and curation of the massive datasets that power our foundation models. We believe that in the era of LLMs, data is the primary competitive advantage. In this role, you will own the end-to-end data lifecycle—from raw web-scale crawling to the fine-grained human-alignment datasets that define model behavior.

This position is ideal for a scientist who views data as a high-scale engineering challenge and an analytical puzzle. You will not just "provide" data; you will design the taxonomies, filtering heuristics, and post-training pipelines that ensure our models are world-class in reasoning, safety, and multimodal understanding.

Key Responsibilities

• Foundation Dataset Strategy: Own the end-to-end creation of pre-training datasets for LLMs. This includes defining the mix of web data, code, books, and technical papers to optimize for downstream model performance.

• Petabyte-Scale Curation: Design and implement sophisticated pipelines for data cleaning, exact/fuzzy deduplication, and high-quality signal extraction from petabytes of raw, unstructured data.

• Post-Training & Alignment Data: Lead the development of high-quality post-training datasets, including Supervised Fine-Tuning (SFT) instructions, multi-turn dialogues, and preference modeling data (RLHF/DPO).

• Multimodal Expansion: Drive the acquisition and processing of vision and video data, navigating the complexities of multimodal alignment, video compression, and temporal data consistency.

• High-Performance Engineering: Develop high-throughput data processing scripts using Python, leveraging multiprocessing and multithreading to handle massive-scale ingestion and transformation without bottlenecks.

• Data Profiling & Analysis: Conduct deep-dive statistical analysis on training corpora to identify biases, gaps in knowledge, and quality regressions, ensuring the "diet" of the model is mathematically balanced.

• Synthetic Data Generation: (Added Value) Design pipelines to generate high-reasoning synthetic data to augment gaps in natural datasets, utilizing existing models for data labeling and refinement.

Must-Haves

• 5+ years of industry experience in Data Science or Machine Learning, with a proven track record of building and managing datasets for foundation models.

• Deep Proficiency in Python: Expert-level skills with a focus on high-performance code, including multiprocessing, multithreading, and efficient memory management for large-scale data tasks.

• Petabyte-Scale Experience: Demonstrated experience working with petabyte-scale datasets that have been directly used to train production-grade LLMs or Large Vision Models.

• Dataset Reconstruction: Experience building massive LLM training sets from scratch, including raw web crawls (e.g., Common Crawl) and specialized domain data.

• Post-Training Expertise: Hands-on experience building datasets for RLHF, DPO, and multi-turn instruction following, including the management of human-labeling workflows and quality gold-sets.

• Data Tooling: Mastery of data-at-scale frameworks such as Spark, Ray, or high-performance data-loading formats (e.g., WebDataset, Parquet).

Nice-to-Haves

• Computer Vision (CV) Curation: Experience building large-scale image or video datasets from scratch (e.g., LAION-style pipelines).

• Multimodal Crawling: Familiarity with large-scale crawling of multimodal data and the associated challenges of video processing, codecs, and compression.

• Taxonomy Design: Experience in designing complex labeling schemas for reasoning, coding, and mathematical benchmarks.

• Research Background: A Master’s or PhD in a quantitative field with a focus on data-centric AI or information retrieval.

Benefits include

• Medical, dental, and vision insurance

• 401k plan

• Daily lunch, snacks, and beverages

• Flexible time off

• Competitive salary and equity

Equal opportunity

Sciforium is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply