Member of Engineering – Pre-training, Data Engineering
Job Description:
- Build and maintain high-performance pipelines for trillions of tokens.
- Deliver diverse and high quality datasets for pre-training foundation models.
- Closely work with other teams such as Pretraining, Posttraining, Evals and Product to to ensure alignment on the quality of the models delivered.
Requirements:
- Strong background in building production-grade, distributed data systems for machine learning, with experience in:
- Orchestration: Slurm, Airflow, or Dagster
- Observability & Reliability: CI/CD, Grafana, Prometheus, etc.
- Infra: Git, Docker, k8s, cloud managed services
- Batched inference (ex: vLLM)
- Performance obsession, especially with large-scale GPU clusters and distributed pipelines
- Expert-level python knowledge and ability to write clean and maintainable code
- Strong algorithmic foundations
- Proficiency with libraries like Polars, Dask, or PySpark
- Nice to have:
- Experience in building trillion-scale SOTA pretraining datasets
- Experience translating research to production at scale
- Experience with OCR, web crawling, or evals
- Prior experience pre-training LLMs
Benefits:
- Fully remote work & flexible hours
- 37 days/year of vacation & holidays
- Health insurance allowance for you and dependents
- Company-provided equipment
- Wellbeing, always-be-learning and home office allowances
- Frequent team get togethers
- Great diverse & inclusive people-first culture
Apply tot his job Apply To this Job