Hyphen Connect Limited
LLM Pre-training & Distributed Engineer (AI Infrastructure)
Be an Early Applicant
Design, orchestrate, and optimize large-scale LLM pre-training across 1,000+ GPUs. Implement 3D parallelism, manage GPU clusters (SLURM/Kubernetes), optimize InfiniBand/RDMA networking and memory, and automate checkpointing and failure recovery for long training runs.
We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities:
- Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
- Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
- Automate checkpointing and failure recovery during month-long training runs.
Required Skills:
- Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
- Experience managing SLURM or Kubernetes-based GPU clusters.
- Strong systems engineering background (C++, CUDA, Python).
Similar Jobs
eCommerce • Fintech • Hardware • Payments • Software • Financial Services
Lead end-to-end enterprise sales for Square9s upmarket business: craft deal strategy, manage complex technical integrations and multi-stakeholder negotiations, partner with Solutions Engineering, align internal teams, represent the company to executives, and close high-value contracts while influencing product and go-to-market strategy.
Top Skills:
Ai ToolsAPIsPaymentsSaaSSquare
Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Lead customer retention and adoption for ServiceNow customers by identifying churn risk, partnering with Sales on adoption/retention plans, advising on governance and SLA issues, and improving customer satisfaction through consulting, project oversight, and executive engagement.
Top Skills:
AIAi-Powered ToolsServicenow
HR Tech • Information Technology • Professional Services • Sales • Software
Design, develop, and maintain scalable backend systems for the Payroll product using a microservices architecture. Own the full development lifecycle from technical design to deployment and monitoring, collaborate with product and front-end teams, build and optimize APIs, and work in a continuous delivery environment with automated QA and testing practices.
Top Skills:
APIsAutomated QaAWSContinuous DeliveryJavaKotlinMicroservicesMockingMonitoringMySQLPostgresScalaTddUnit Testing
What you need to know about the Toronto Tech Scene
Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.

.png)

