Boson AI Logo

Boson AI

Senior System Administrator / Site Reliability Engineer

Posted 3 Days Ago
Be an Early Applicant
Hybrid
Toronto, ON
Senior level
Hybrid
Toronto, ON
Senior level
The Senior System Administrator / Site Reliability Engineer will manage high-end GPU clusters and handle the full lifecycle of physical systems. Responsibilities include configuring and maintaining network switches, automating Linux systems, and learning and deploying new tools. Strong problem-solving skills and experience with various infrastructure technologies are essential.
The summary above was generated by AI

Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language and beyond.


About The Role


We are looking for a Senior Infrastructure Engineer / System Administrator to help us operate our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, OPNSense, networking and related tools is a big plus. You should be comfortable performing some amount of hardware configuration. 


You will have the opportunity to work with the latest NVIDIA H100 GPUs, many PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

A day in the life:

  • Manage private large high-end GPU clusters
  • Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting
  • Configure and maintain network switches (Tomahawk TH3, Mellanox Infiniband)
  • Configure and maintain MAAS (metal as a service), Ceph, and Slurm
  • Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
  • Configure and maintain network and security tools, including VPN, VLAN, DHCP, SSO, MFA
  • Learn about new tools and deploy them

You might be a great fit if you have:

  • Strong background in system operations, including Slurm, Ansible, MAAS, Ceph, OPNsense and Kubernetes
  • Experience with with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
  • Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience with managing firmware / systems updates for systems, e.g. on SuperMicro

The ability to solve problems and to learn new techniques is key.

Top Skills

Python

Boson AI Toronto, Ontario, CAN Office

Toronto, Canada

Similar Jobs

Be an Early Applicant
3 Hours Ago
Toronto, ON, CAN
Hybrid
90,000 Employees
Mid level
90,000 Employees
Mid level
Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
As a Developer in Product and Process Development at Mondelēz International, you will manage RDQ activities for new product development and maintain portfolios. Your responsibilities include planning trials, analyzing experimental data, creating reports, and collaborating with cross-functional teams to ensure consumer satisfaction.
Be an Early Applicant
7 Hours Ago
Toronto, ON, CAN
Hybrid
26,000 Employees
Senior level
26,000 Employees
Senior level
Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
As a Solution Consultant, you will support solution sales by guiding revenue through product-specific solutions. Your role includes leading workshops, providing product demonstrations, answering technical questions, offering feedback for enhancements, and participating in marketing events while achieving sales goals for your territory.
Be an Early Applicant
10 Hours Ago
Toronto, ON, CAN
20,000 Employees
Senior level
20,000 Employees
Senior level
Food • Retail • Agriculture • Manufacturing
The OT Solution Architect will design and implement architectural solutions for OT systems, integrating them with cloud platforms while ensuring performance, security, and compliance. Responsibilities include optimizing system performance, collaborating with cross-functional teams, managing vendor relationships, and providing troubleshooting expertise.

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account