Hyphen Connect Limited Logo

Hyphen Connect Limited

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Posted 6 Days Ago
Be an Early Applicant
In-Office or Remote
Hiring Remotely in CA
Senior level
In-Office or Remote
Hiring Remotely in CA
Senior level
Design, orchestrate, and optimize large-scale LLM pre-training across 1,000+ GPUs. Implement 3D parallelism, manage GPU clusters (SLURM/Kubernetes), optimize InfiniBand/RDMA networking and memory, and automate checkpointing and failure recovery for long training runs.
The summary above was generated by AI

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing  distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities:

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.

Required Skills:

  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
  • Experience managing SLURM or Kubernetes-based GPU clusters.
  • Strong systems engineering background (C++, CUDA, Python).

Similar Jobs

An Hour Ago
Remote or Hybrid
CA
Senior level
Senior level
eCommerce • Fintech • Hardware • Payments • Software • Financial Services
Lead end-to-end enterprise sales for Square9s upmarket business: craft deal strategy, manage complex technical integrations and multi-stakeholder negotiations, partner with Solutions Engineering, align internal teams, represent the company to executives, and close high-value contracts while influencing product and go-to-market strategy.
Top Skills: Ai ToolsAPIsPaymentsSaaSSquare
3 Hours Ago
Remote or Hybrid
Toronto, ON, CAN
Expert/Leader
Expert/Leader
Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Lead customer retention and adoption for ServiceNow customers by identifying churn risk, partnering with Sales on adoption/retention plans, advising on governance and SLA issues, and improving customer satisfaction through consulting, project oversight, and executive engagement.
Top Skills: AIAi-Powered ToolsServicenow
3 Hours Ago
Remote or Hybrid
Canada
Senior level
Senior level
HR Tech • Information Technology • Professional Services • Sales • Software
Design, develop, and maintain scalable backend systems for the Payroll product using a microservices architecture. Own the full development lifecycle from technical design to deployment and monitoring, collaborate with product and front-end teams, build and optimize APIs, and work in a continuous delivery environment with automated QA and testing practices.
Top Skills: APIsAutomated QaAWSContinuous DeliveryJavaKotlinMicroservicesMockingMonitoringMySQLPostgresScalaTddUnit Testing

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account