The Lead Site Reliability Engineer will oversee the Infrastructure SRE team, focusing on system reliability, automation, and mentoring while collaborating with product engineering.
JOB DESCRIPTION
We are seeking a Lead Site Reliability Engineer (Infrastructure) to act as technical lead for our Infrastructure SRE team in a fast-moving VSaaS engineering organization. In this role, you will own the team's technical direction and execution across reliability, scalability, and operability of our shared platform and production systems, combining hands-on technical leadership with responsibility for team outcomes.
You will define SRE strategy and guide architecture across our GCP and Kubernetes ecosystem, setting standards for reliability, scalability, GitOps, and observability. You will also mentor senior and staff engineers, and lead incident response and high-impact operational work, contributing hands-on when needed.
Role Overview
Site Reliability Engineer - Infrastructure
In this role, you will translate product and business needs into scalable infrastructure and clear technical direction. With a system-wide view of the platform, you will guide architectural decisions, surface non-obvious risks, and drive long-term improvements to system reliability and operability.
Working closely with product and platform teams, you will shape the developer experience and ensure engineering teams can ship with speed and confidence. You will set engineering standards and continuously evolve our GitOps and observability practices.
This role requires strong expertise in cloud infrastructure, distributed systems, and CI/CD, along with hands-on experience in Golang and/or Python to support automation and long-term system reliability.
Responsibilities
As a Lead Site Reliability Engineer, you will:
Minimum Qualifications
Why Milestone?
Milestone offers not only great benefits but also great culture. Employees here have flexible work environments, opportunities for further education, and the ability to effect change in our Organization directly.
The annual salary for this position ranges from $160,000 to $180,000 range. Pay is based on the level, location, complexity, responsibility, and job duties of the specific position and is just one component of Milestone's total compensation package. Additionally, we offer an attractive benefits package that includes medical/dental benefits, FSA or HSA, 401k with 6% Safe Harbor employer match, paid parental leave, generous PTO (20 days' vacation, 10 days paid sick time, and 12 company holidays), fully paid Short Term disability policy, fully paid Long Term disability policy, and Life Insurance. If you are selected for an interview, please feel welcome to speak to our Talent Partner about our compensation philosophy.
All employees must complete a background check. Employees in fiscal roles are also required to undergo a credit check. All information obtained during these checks is handled confidentially and shared only with authorized personnel.
Milestone is committed to creating a diverse and inclusive workplace and is proud to be an equal opportunity employer.
Contact and application
Please apply at our website: www.milestonesys.com
We are looking forward to receiving your application
We are seeking a Lead Site Reliability Engineer (Infrastructure) to act as technical lead for our Infrastructure SRE team in a fast-moving VSaaS engineering organization. In this role, you will own the team's technical direction and execution across reliability, scalability, and operability of our shared platform and production systems, combining hands-on technical leadership with responsibility for team outcomes.
You will define SRE strategy and guide architecture across our GCP and Kubernetes ecosystem, setting standards for reliability, scalability, GitOps, and observability. You will also mentor senior and staff engineers, and lead incident response and high-impact operational work, contributing hands-on when needed.
Role Overview
Site Reliability Engineer - Infrastructure
In this role, you will translate product and business needs into scalable infrastructure and clear technical direction. With a system-wide view of the platform, you will guide architectural decisions, surface non-obvious risks, and drive long-term improvements to system reliability and operability.
Working closely with product and platform teams, you will shape the developer experience and ensure engineering teams can ship with speed and confidence. You will set engineering standards and continuously evolve our GitOps and observability practices.
This role requires strong expertise in cloud infrastructure, distributed systems, and CI/CD, along with hands-on experience in Golang and/or Python to support automation and long-term system reliability.
Responsibilities
As a Lead Site Reliability Engineer, you will:
- Team Leadership & Execution Ownership: Own technical direction and execution of the Infrastructure SRE team. Translate platform goals into actionable plans, ensuring alignment on priorities, reliability outcomes, and operational excellence across production systems.
- Production Operations & Incident Management: Operate and evolve large-scale distributed systems in production, proactively identifying failure modes and mitigating risk. Own day-to-day operations including monitoring, alerting, incident response, coordination, post-incident analysis, and continuous improvement.
- Architecture, Standards & Platform Governance: Provide architectural leadership across platform and infrastructure changes, identifying scalability constraints, system design risks, and long-term reliability gaps. Define and enforce engineering standards for GCP, Kubernetes, and ArgoCD, ensuring consistent, secure, GitOps-based delivery.
- Reliability Engineering & Observability: Lead strategy for monitoring, alerting, and system observability, driving a shift from reactive incidents to proactive reliability engineering.
- Enablement, CI/CD & Collaboration: Guide CI/CD and cloud-native delivery practices at scale to ensure safe, scalable releases. Mentor senior and staff engineers, conduct high-impact design and code reviews (Golang/Python), and partner with product and engineering teams to embed system-level thinking across development.
- Hands-on Technical Contribution: Provide hands-on technical contribution where needed, including debugging production issues, reviewing and contributing to code, and supporting critical incident resolution to ensure system reliability and team effectiveness.
- Other duties as assigned are absorbed into the above ownership and operational responsibilities.
Minimum Qualifications
- Leadership & Experience: 10+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering, including demonstrated experience leading technical engineering teams, driving roadmaps, and owning delivery of large-scale production systems.
- Cloud & Distributed Systems Expertise: Deep experience with cloud-native architectures and distributed systems at scale, particularly in GCP and Kubernetes environments. Ability to reason about system design, identify failure modes, and evaluate scalability and reliability risks.
- GitOps & Delivery Engineering: Strong experience with GitOps-based delivery workflows, particularly ArgoCD, and CI/CD pipeline design. Ability to ensure safe, repeatable, and observable production deployments.
- Infrastructure & Automation: Strong hands-on background in infrastructure-as-code (Terraform preferred), automation, and operational tooling. Proficiency in Golang and/or Python for building and reviewing production systems. Strong Linux systems knowledge and production troubleshooting experience.
- Observability & Reliability Engineering: Experience designing or operating observability systems (logging, monitoring, alerting) and applying SRE principles such as SLOs, incident management, postmortems, and reliability engineering practices.
- Technical Oversight & Engineering Quality: Ability to review and critique system design and production code, ensuring engineering quality across backend systems and infrastructure components.
- Communication & Leadership Influence: Ability to influence technical direction, communicate trade-offs to stakeholders, and drive alignment across product and engineering teams on reliability and platform priorities.
Why Milestone?
Milestone offers not only great benefits but also great culture. Employees here have flexible work environments, opportunities for further education, and the ability to effect change in our Organization directly.
The annual salary for this position ranges from $160,000 to $180,000 range. Pay is based on the level, location, complexity, responsibility, and job duties of the specific position and is just one component of Milestone's total compensation package. Additionally, we offer an attractive benefits package that includes medical/dental benefits, FSA or HSA, 401k with 6% Safe Harbor employer match, paid parental leave, generous PTO (20 days' vacation, 10 days paid sick time, and 12 company holidays), fully paid Short Term disability policy, fully paid Long Term disability policy, and Life Insurance. If you are selected for an interview, please feel welcome to speak to our Talent Partner about our compensation philosophy.
All employees must complete a background check. Employees in fiscal roles are also required to undergo a credit check. All information obtained during these checks is handled confidentially and shared only with authorized personnel.
Milestone is committed to creating a diverse and inclusive workplace and is proud to be an equal opportunity employer.
Contact and application
Please apply at our website: www.milestonesys.com
We are looking forward to receiving your application
Similar Jobs at Milestone Systems
Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics
The Regional Sales Executive drives growth in mid-market enterprises, engaging end users and partners, managing complex sales cycles, and ensuring compliance with channel sales protocols.
Top Skills:
Salesforce
Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics
The Solutions Engineer at Milestone Systems drives pre-sales by designing technical solutions, collaborating with sales to meet customer needs, and ensuring successful deployments.
Top Skills:
Cloud ServicesLinuxNetworking HardwareServer HardwareStorage HardwareWindows
What you need to know about the Toronto Tech Scene
Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.

.png)