Design, build, and optimize high-performance network infrastructure for AI/ML operations, manage network components, and troubleshoot issues.
About The Role
We're seeking an experienced Network Engineer to design, build, and optimize the high-performance networking infrastructure powering our AI/ML operations in Toronto. You'll work at the cutting edge of network technology—managing InfiniBand and ultra-high-speed Ethernet fabrics that connect NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, and hundreds of servers.
You'll be hands-on with the full lifecycle of our network infrastructure: planning, building, testing, deploying, and keeping everything running at peak performance. That means troubleshooting issues as they arise, monitoring network performance and throughput, developing automation to streamline operations, and working closely with HPC and ML teams to ensure they have the bandwidth they need. You'll also help us plan for future capacity and evaluate emerging network technologies as we scale to meet increasingly demanding workloads.
Responsibilities
- Configure and maintain InfiniBand and high-speed Ethernet fabrics
- Optimize network performance for RDMA, and GPU-to-GPU communication
- Manage network switches (Mellanox, NVIDIA, Micas Networks)
- Troubleshoot network bottlenecks and latency issues
- Plan and execute network upgrades and expansions
- Network security implementation (firewalls, VLANs, ACLs)
- Collaborate on storage network optimizationInfrastructure monitoring
Minimum Qualifications
- 4+ years of network engineering experience in production environments
- Strong understanding of L2/L3 networking protocols (TCP/IP, BGP, OSPF, VLANs)
- Hands-on experience with high-speed networking (100Gb+ Ethernet and InfiniBand)
- Hands-on experience with network security (firewalls, ACLs, network segmentation)
- Knowledge of HPC network topologies
- Experience with InfiniBand fabrics including RDMA, RoCE, IPoIB
- Strong troubleshooting and problem-solving skills
Preferred Qualifications
- Experience in data center environments or AI/ML infrastructure
- Hands-on experience with high-performance Ethernet switches (e.g., Broadcom Tomahawk), and latest InfiniBand switches (e.g., Nvidia/Mellanox)
- Experience optimizing networks for GPU-to-GPU communication
- Experience with open-source firewall solutions (OPNsense, pfSense, or similar)
- Experience with network automation tools
- Understanding of distributed storage networking (Ceph cluster networks)
- Familiarity with network monitoring and observability tools (Prometheus, Grafana)
- Knowledge of multi-site network connectivity and WAN optimization
- Familiarity with cloud networking in at least one platform (AWS, GCP, or Azure) including VPC design, site-to-site VPN configuration, Direct Connect/ExpressRoute/Cloud Interconnect, hybrid cloud connectivity, and cloud-to-datacenter network integration
If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you.
Top Skills
AWS
Azure
Bgp
Ceph
Ethernet
GCP
Grafana
Infiniband
Ipoib
Ospf
Prometheus
Rdma
Roce
Tcp/Ip
Vlans
Boson AI Toronto, Ontario, CAN Office
Toronto, Canada
Similar Jobs
Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
Support engineering with data management, documentation, and coordination. Build and maintain Copilot Studio agents, design conversation flows, connect agents to business systems, create knowledge sources, test and monitor agent performance, document designs, and automate workflows using Power Platform and basic scripting.
Top Skills:
DataverseExcelGrafanaJavaScriptMicrosoft Copilot StudioMicrosoft Sql ServerMicrosoft TeamsMqttOnedrivePower AppsPower AutomatePower BIPythonSharepointSQL
Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
Designs and improves protective and returnable packaging, maintains PFEP and packaging databases, creates JES, works with suppliers on test packs, develops CAD-based packaging and racking designs, and implements process improvements to optimize material flow and ergonomics.
Top Skills:
AutocadSolidworks
Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
Support manufacturing and continuous improvement projects to improve quality, throughput, OEE and ergonomics. Provide production engineering support, develop layouts and work instructions, perform PFMEAs and APQP activities, lead supplier CI projects, and ensure adherence to quality standards and safety.
Top Skills:
Autocad,Microsoft Word,Microsoft Excel,Automation Controls,Lean Manufacturing,5S,Pfmea,Apqp,Iatf 16949,Oee
What you need to know about the Toronto Tech Scene
Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.
