Senior System Administrator / Site Reliability Engineer

Posted 8 Days Ago
Be an Early Applicant
Toronto, ON
Hybrid
Senior level
Artificial Intelligence • Machine Learning
The Role
As a Senior System Administrator / Site Reliability Engineer, you will manage the operation of a high-end GPU datacenter, including deploying and maintaining hardware systems, configuring network tools, and automating Linux systems using infrastructure-as-code practices. You will also be responsible for learning new tools and optimizing environments for machine learning workloads.
Summary Generated by Built In

Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language and beyond.


About The Role


We are looking for a Senior Infrastructure Engineer / System Administrator to help us operate our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, OPNSense, networking and related tools is a big plus. You should be comfortable performing some amount of hardware configuration. 


You will have the opportunity to work with the latest NVIDIA H100 GPUs, many PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

A day in the life:

  • Manage private large high-end GPU clusters
  • Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting
  • Configure and maintain network switches (Tomahawk TH3, Mellanox Infiniband)
  • Configure and maintain MAAS (metal as a service), Ceph, and Slurm
  • Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
  • Configure and maintain network and security tools, including VPN, VLAN, DHCP, SSO, MFA
  • Learn about new tools and deploy them

You might be a great fit if you have:

  • Strong background in system operations, including Slurm, Ansible, MAAS, Ceph, OPNsense and Kubernetes
  • Experience with with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
  • Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience with managing firmware / systems updates for systems, e.g. on SuperMicro

The ability to solve problems and to learn new techniques is key.

Top Skills

Python
The Company
Toronto
21 Employees
On-site Workplace
Year Founded: 2023

What We Do

We are transforming how stories are told, knowledge is learned, and insights are gathered

Similar Jobs

Braze Logo Braze

Salesforce DevOps Lead

Marketing Tech • Mobile • Software
Easy Apply
Remote
Ontario, ON, CAN
1500 Employees

Chainlink Labs Logo Chainlink Labs

Senior Fullstack Engineer, Developer Services

Blockchain • Internet of Things • Payments • Cryptocurrency • Web3
Remote
Toronto, ON, CAN
650 Employees

Morningstar Logo Morningstar

Software Engineer

Enterprise Web • Fintech • Financial Services
Hybrid
Toronto, ON, CAN
12700 Employees

Warner Bros. Discovery Logo Warner Bros. Discovery

Manager, Software Development

Artificial Intelligence • Digital Media • Gaming • Machine Learning • News + Entertainment • Software
Kanata, ON, CAN
40000 Employees

Similar Companies Hiring

Motorola Solutions Thumbnail
Software • Security • Information Technology • Hardware • Cybersecurity • Big Data Analytics • Artificial Intelligence
Chicago, IL
21000 Employees
SailPoint Thumbnail
Software • Security • Sales • Data Privacy • Cybersecurity • Cloud • Artificial Intelligence
Austin, TX
2461 Employees
Capital One Thumbnail
Software • Payments • Machine Learning • Fintech • Financial Services
McLean, VA
55000 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account