ML Engineer - Infrastructure

Sorry, this job was removed at 04:18 p.m. (EST) on Friday, May 16, 2025

In-Office

7 Locations

In-Office

7 Locations

Similar Jobs

Boson AI

Site Reliability Engineer

12 Days Ago

In-Office

Toronto, ON, CAN

Senior level

Artificial Intelligence • Machine Learning

As a Senior Site Reliability Engineer, you'll manage HPC infrastructure, optimize operations, automate tasks, and support engineering and research teams.

Top Skills: AnsibleAWSAzureBashCephGCPGitopsKubernetesLinuxPythonPyTorchTensorFlowTerraform

Boson AI

Network Engineer

25 Days Ago

In-Office

Toronto, ON, CAN

Mid level

Artificial Intelligence • Machine Learning

The Network Engineer will design, build, and optimize networking infrastructure for AI/ML operations, manage network fabrics, troubleshoot issues, and plan for capacity.

Top Skills: AWSAzureBgpBroadcom TomahawkCephCloud NetworkingEthernetGCPGrafanaInfinibandIpoibMellanoxNvidiaOpnsenseOspfPfsensePrometheusRdmaRoceTcp/IpVlans

Stripe

Software Engineer

23 Days Ago

In-Office

Toronto, ON, CAN

Junior

Payments • Software

You will design and build scalable ML infrastructure services, improve productivity for ML engineers, and enhance MLOps across the company.

Top Skills: Ai AgentsDistributed SystemsLlm ApplicationsMachine Learning ModelsMlopsService Oriented Architecture

ML Engineer (Infrastructure)

Location: San Francisco - Bay Area Hybrid

About the Role:

WitnessAI is a leader in providing innovative networking solutions designed to enhance security, performance, and reliability for businesses of all sizes. We are seeking an ML Infrastructure Engineer to optimize, deploy and scale machine learning models in production environments. You will play a critical role in scaling GPU resources, building continuous learning pipelines, and integrating a variety of inference frameworks. Your expertise in model quantization, pruning, and other optimization techniques will ensure our models run efficiently and effectively.

You will contribute to our mission through the following:

Develop and Optimize: Design and manage scalable GPU infrastructures for model training and inference. Build automated pipelines that accelerate ML workflows, implement feedback loops for continuous learning, and enhance model efficiency in resource-constrained environments.
Implement Advanced Inference Solutions: Evaluate and integrate inference platforms like NVIDIA Triton and vLLM to ensure high availability, scalability, and reliability of deployed models.
Collaborate for Impact: Work closely with applied scientists, software engineers, and DevOps professionals to deploy models that drive our company's mission forward. Document best practices to support team knowledge sharing and improve code quality and reproducibility.

The ideal candidate will have expertise in designing, developing, and maintaining scalable ML infrastructure components, including data pipelines and deployment systems. You should have a demonstrated track record of optimizing ML workflows for performance and resource utilization, and stay up to date on best practices for model management and reproducibility. Strong communication skills and the ability to collaborate across functions to execute complex projects are essential.

Qualifications

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

2+ years of experience building and scaling machine learning systems.
Proven experience in scaling GPU resources for machine learning applications.
Experience with inference platforms like NVIDIA Triton, vLLM, or similar.
Demonstrated expertise in model quantization, pruning, and other optimization techniques with frameworks such as TensorRT, ONNX or others.

Skilled in automating data collection, preprocessing, model retraining, and deployment.
Proficient with cloud platforms such as AWS (preferred), GCP, or Azure, especially in deploying and managing GPU instances.
Strong skills in Python; familiarity with other scripting languages is a plus.
Experience with CUDA packages.
Experience with PyTorch, Tensorflow or similar frameworks.
Proficient in Docker and Kubernetes.
Experience with Jenkins, Github CI/CD, or similar tools.
Experience with Prometheus, Grafana, or similar monitoring solutions.

Soft Skills

Strong problem-solving and analytical abilities.
Excellent communication and teamwork skills.
Ability to work independently and manage multiple tasks effectively.
Proactive attitude toward learning and adopting new technologies.

Benefits:

Hybrid work environment
Competitive salary.
Health, dental, and vision insurance.
401(k) plan.
Opportunities for professional development and growth.
Generous vacation policy.

Salary range:

$140,000-$170,000

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.