Boson AI Logo

Boson AI

High Performance Computing Engineer

Job Posted 13 Days Ago Posted 13 Days Ago
Be an Early Applicant
Hybrid
Toronto, ON
Senior level
Hybrid
Toronto, ON
Senior level
As a Senior High Performance Computing Engineer, you will manage GPU clusters, oversee data center operations, and deploy high-end infrastructure technologies. You'll troubleshoot and maintain hardware systems and optimize network configurations, while coding and automating processes.
The summary above was generated by AI

Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.


About The Role


We are looking for a Senior High Performance Computing Engineer to help us operate the GPUs, network and filesystem in our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, Infiniband, NVIDIA deepops, Ethernet networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration. 


You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 20PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

A day in the life:

  • Manage private large high-end GPU clusters
  • Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting
  • Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)
  • Configure and maintain MAAS, Ceph, Slurm and Kubernetes
  • Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
  • Configure and maintain network, e.g. Layer 3 networking
  • Learn about new tools and deploy them

You might be a great fit if you have:

  • Strong background in high performance computing
  • Experience with with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
  • Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience with managing firmware / systems updates for systems, e.g. on SuperMicro

The ability to solve problems and to learn new techniques is key.

Top Skills

Ceph
Ethernet Networking
Infiniband
Kubernetes
Linux
Maas
Nvidia Deepops
Python
Slurm

Boson AI Toronto, Ontario, CAN Office

Toronto, Canada

Similar Jobs

10 Hours Ago
Hybrid
Toronto, ON, CAN
Mid level
Mid level
Cloud • Mobile • Software
As a Backend Engineer, you will develop and enhance the platform, build APIs, mentor engineers, and ensure high-quality software delivery.
Top Skills: ApolloCi/CdEcmascriptGraphQLJavaScriptJestNode.jsPythonReact
18 Hours Ago
Hybrid
St. Thomas, ON, CAN
Junior
Junior
Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
The Quality Engineering Coordinator supports quality management systems, ensures compliance with standards, and acts as a liaison with customers for quality issues.
Top Skills: Advanced Gd&TFixturesGaugingIatf 16949MetrologyQuality Management SystemsQuality Software
22 Hours Ago
Easy Apply
Hybrid
Mississauga, ON, CAN
Easy Apply
Mid level
Mid level
Artificial Intelligence • eCommerce • Information Technology • Mobile • Payments • App development • Utilities
The Data Quality Engineer collaborates with teams to ensure data quality, develops test cases, conducts integration testing, and enhances data accuracy.
Top Skills: AWSBashCypressDatabricksETLGitIcedqJavaJavaScriptLakehouseLinuxMongoDBMySQLOraclePowershellPytestPythonRanorexRedshiftSeleniumSQLUnixWindows

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.
By clicking Apply you agree to share your profile information with the hiring company.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account