The Site Reliability Engineer ensures operational excellence, reliability, and uptime of production systems through monitoring, incident response, and automation. Responsibilities include defining SLAs, leading incident management, implementing observability, and collaborating on system designs.
Who we are!
- Speer Technologies is a dynamic technology hub based in Toronto, partnered with some of the largest technology incubators in the Greater Toronto Area. We are a team of passionate innovators and open-minded thinkers, dedicated to building groundbreaking technologies. Our products are on the path to receiving FDA and ADA approvals or provisional patents, with partnerships spanning Italy, Germany, California, and France.
- As a startup, we thrive on creativity, collaboration, and the drive to push boundaries. Our fast-paced environment offers exposure to a variety of programming languages, software, and work environments, ensuring a rich learning experience. We provide ample opportunities for personal and professional growth, all while fostering an inclusive and barrier-free workplace.
- Speer is an equal opportunity employer and is committed to providing an inclusive and barrier-free recruitment process. We will accommodate the needs of applicants under the Ontario Human Rights Code and the Accessibility for Ontarians with Disabilities Act (AODA) throughout all stages of the recruitment and selection process.
- Please advise Speer of any accommodations you may require to ensure your equal participation in the recruitment and selection process. Information received relating to accommodation measures will be addressed confidentially.
- Growth Opportunities: We offer the chance to grow with the company and take on new responsibilities as we expand.
- Dynamic Environment: Our fast-paced startup environment ensures no two days are the same.
- Innovation: Be part of a team that's pushing the boundaries of technology and making a real impact.
- Inclusive Workplace: We are committed to creating an inclusive environment where all employees can thrive.
The Site Reliability Engineer (SRE) is responsible for ensuring the availability, reliability, and operational excellence of production systems. This role bridges infrastructure engineering and operations by applying software engineering principles to infrastructure, monitoring, incident response, and continuous improvement. The SRE ensures systems meet defined uptime, performance, and resiliency targets as they scale.
Key ResponsibilitiesReliability & Availability
- Define and enforce SLAs, SLOs, and error budgets for infrastructure and applications.
- Design and maintain monitoring, alerting, and health checks across network, hardware, and application layers.
- Identify reliability risks and implement preventative controls.
Incident Management
- Lead or participate in incident response and on-call rotations.
- Reduce MTTR (Mean Time to Recovery) through automation, runbooks, and alert tuning.
- Conduct post-incident reviews and drive corrective actions.
Observability & Monitoring
- Implement centralized logging, metrics, and tracing.
- Monitor system health including connectivity, latency, hardware status, and application availability.
- Ensure alerts are actionable and aligned to business impact.
Automation & Tooling
- Automate repetitive operational tasks.
- Build self-healing mechanisms where possible.
- Improve deployment and rollback reliability.
Collaboration & Architecture
- Partner with Infrastructure Architects to validate resiliency and failover designs.
- Support Systems Implementation Engineers during go-lives and major changes.
- Provide operational feedback to improve future designs.
Documentation & Operational Readiness
- Create and maintain runbooks, escalation paths, and recovery procedures.
- Ensure operational readiness before systems enter production.
- Continuously improve reliability through data-driven analysis.
- Experience supporting production infrastructure with uptime or SLA requirements.
- Strong understanding of networking concepts (latency, packet loss, redundancy).
- Experience with monitoring and observability tools.
- Familiarity with incident management and on-call practices.
- Ability to automate operational workflows.
- Strong troubleshooting and root-cause analysis skills.
- Clear written and verbal communication.
- Experience supporting multi-site or distributed systems.
- Exposure to IoT, access control, cameras, or physical infrastructure.
- Experience in regulated or compliance-heavy environments.
- Background in systems engineering or infrastructure architecture.
- Fluency in French is an asset.
- Job Type: Full-Time
- Pay: $70,000–$120,000 a year
- Flexible language requirement: French not required
- Schedule: Monday to Friday
- Dental care
- Paid time off
- Vision care
- Wellness program
Top Skills
Automation
Monitoring Tools
Observability Tools
Speer Toronto, Ontario, CAN Office
379 Shuter St, Toronto, ON , Canada, M5A 1X3
Similar Jobs
Artificial Intelligence • Marketing Tech • Software
Lead technical reliability initiatives across a multi-cloud, multi-region active-active content platform. Architect and evolve core services, observability and logging, automation and capacity planning. Mentor engineers, drive cross-team reliability projects, define standards (IaC, SLOs, on-call) and proactively improve platform scalability and incident outcomes.
Top Skills:
Apache Pulsar,Apache Kafka,Grafana Loki,Scylladb,Cassandra,Prometheus,Thanos,Grafana Alloy,Tempo,Terraform,Chef,Eks,Gke,Kubernetes,Nodejs,Golang,Ruby,Python,Shell Scripting,Linux,Aws,Gcp
Artificial Intelligence • Cloud • Information Technology • Machine Learning • Software • Big Data Analytics • Automation
As a Site Reliability Engineer II, you'll enhance PagerDuty's infrastructure, ensuring reliability and scalability while monitoring system health and participating in on-call rotations.
Top Skills:
AWSAzureCloudFormationDatadogGCPGoGrafanaKubernetesLinuxNew RelicPrometheusPythonRubySplunkSumologicTerraform
Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)
The role involves designing and developing Cloud Native Control-Plane functions, ensuring operational performance, and collaborating in a DevOps environment with a focus on SRE principles.
Top Skills:
AWSAzureCi/CdCloud NativeCloud-Based DatabasesDockerGoogle Cloud PlatformGrpcKubernetesRestful ApisTerraform
What you need to know about the Toronto Tech Scene
Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.



