Cohere AI Jobs

Senior ML Systems Engineer, Frameworks & Tooling

Cohere AI

Senior ML Systems Engineer, Frameworks & Tooling

Reposted 14 Days Ago

In-Office or Remote

Hiring Remotely in Toronto, ON, CAN

Senior level

In-Office or Remote

Hiring Remotely in Toronto, ON, CAN

Senior level

The Senior ML Systems Engineer will build and maintain the training framework for large-scale language models, focusing on distributed training and performance optimization.

The summary above was generated by AI

Who are we?

Cohere is the leading security-first enterprise AI company. We build cutting-edge foundation AI models and end-to-end products that are designed to solve real-world business problems.

We’re training and deploying frontier models for enterprises who are building AI systems. We believe that our work is instrumental to the widespread adoption of AI and we are looking for folks that want to be part of that.

We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. Cohere is a team of researchers, engineers, designers, and more, who are all passionate about their craft.

We are a global technology company co-headquartered in Toronto and San Francisco, with key offices in London, New York City, Montreal, Seoul, Germany and Paris. Join us!

We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs.

If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.

What You’ll Work On

Build and own the training framework responsible for large-scale LLM training.
Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training.
Investigate and resolve performance bottlenecks across the ML systems stack.
Build robust systems that ensure reproducible, debuggable, large-scale runs.

You Might Be a Good Fit If You Have

Strong engineering experience in large-scale distributed training or HPC systems.
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
Experience working with containerized environments (Docker, Singularity/Apptainer).
A track record of building tools that increase developer velocity for ML teams.
Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.
Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.

Nice to Have

Experience with training LLMs or other large transformer architectures.
Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.).
Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches).
Experience with data pipeline optimization, sharded datasets, or caching strategies.
Background in performance engineering, profiling, or low-level systems.

Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).

Why Join Us

You’ll work on some of the most challenging and consequential ML systems problems today.
You’ll collaborate with a world-class team working fast and at scale.
You’ll have end-to-end ownership over critical components of the training stack.
You’ll shape the next generation of infrastructure for frontier-scale models.
You’ll build tools and systems that directly accelerate research and model quality.

Sample Projects:

Build a high-performance data loading and caching pipeline.
Implement performance profiling across the ML systems stack
Develop internal metrics and monitoring for training runs.
Build reproducibility and regression testing infrastructure.
Develop a performant fault-tolerant distributed checkpointing system.

Full-Time Employees at Cohere enjoy these Perks:

A weekly lunch stipend of $75/£75 or equivalent in your local currency for lunch.
Full health and dental benefits, including a separate budget for mental health.
RRSP matching, 401K, Pension Scheme.
100% Parental Leave top-up for up to 6 months, for either parent.
Annual enrichment benefits:
Arts & culture, fitness/wellness, quality time, and a workspace improvement credit.
Education & learning stipend for conferences, courses, and coaching.

6 weeks of paid vacation (30 working days!)
Budget for traveling to other offices if you are remote, plus an annual company offsite.

How and Where We Work:

Cohere is remote-friendly. We have offices in Toronto, San Francisco, New York City, London, Paris, Montreal, and more coming soon.
For those in the office: a daily lunch program, plenty of snacks, and regular community and social events.
For those not near an office: a co-working benefit so you can work alongside others in your city.
Everyone receives a $500 home office stipend to set up your workspace properly.

If any of the above doesn’t line up exactly with your experience, we still encourage you to apply.

We strive to create an inclusive work environment for all; we welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.

We may use AI-enabled tools to screen and assess applicants against the criteria for this position. This helps our recruiters identify potentially qualified candidates, but it doesn't limit the applications our recruiters may review or consider.

Beware of Scams: Cohere will never ask for payment or third-party services (e.g., CV writing) as part of our hiring process. All legitimate roles are listed on the Cohere careers page and LinkedIn only, with all communications from Cohere employees coming from an @cohere.com or @cw.cohere email alias. If jobs are viewed on other sites then please verify these through our official careers page.

Toronto, Ontario, Canada

Similar Jobs

Cash App

Solutions Engineer

21 Hours Ago

Remote or Hybrid

Senior level

Blockchain • Fintech • Mobile • Payments • Software • Financial Services

The role involves supporting integration and sales of Square's Developer Platform, collaborating with partners, and advising on solutions for sellers.

Top Skills: Full-Stack DevelopmentRestful ApisSaaS

Webflow

Solutions Engineer

21 Hours Ago

Easy Apply

Remote

Easy Apply

Senior level

Artificial Intelligence • Enterprise Web • Software • Design • Generative AI

Partner with Account Executives to design and demonstrate Webflow solutions for Enterprise customers. Lead discovery, solution design, technical validation, demos, workshops, and business case development. Influence product direction, run cross-functional initiatives, and travel up to 25% while supporting multiple active opportunities.

Top Skills: AIAPIsCSSHTMLJavaScriptWebflow

GC AI

Member of Technical Staff, Forward Deployed Engineer

21 Hours Ago

Remote

Canada

Mid level

Artificial Intelligence • Legal Tech

Embed with customers to design, build, and deploy GC AI integrations into legal workflows. Develop production-grade API/webhook integrations, troubleshoot deployments, create reference implementations and documentation, and feed product insights back to Engineering to improve the platform. Travel to customer sites up to 25% as needed.

Top Skills: APIsLlmsPythonSdksTypescriptWebhooksWorkflow Automation

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.

Cohere AI

Senior ML Systems Engineer, Frameworks & Tooling

Cohere AI Toronto, Ontario, CAN Office

Similar Jobs

Solutions Engineer

Solutions Engineer

Member of Technical Staff, Forward Deployed Engineer

What you need to know about the Toronto Tech Scene