Speer

Site Reliability Engineer

Reposted 14 Days Ago

Be an Early Applicant

In-Office

Toronto, ON

Mid level

In-Office

Toronto, ON

Mid level

The Site Reliability Engineer ensures operational excellence, reliability, and uptime of production systems through monitoring, incident response, and automation. Responsibilities include defining SLAs, leading incident management, implementing observability, and collaborating on system designs.

The summary above was generated by AI

Who we are!

Speer Technologies is a dynamic technology hub based in Toronto, partnered with some of the largest technology incubators in the Greater Toronto Area. We are a team of passionate innovators and open-minded thinkers, dedicated to building groundbreaking technologies. Our products are on the path to receiving FDA and ADA approvals or provisional patents, with partnerships spanning Italy, Germany, California, and France.
As a startup, we thrive on creativity, collaboration, and the drive to push boundaries. Our fast-paced environment offers exposure to a variety of programming languages, software, and work environments, ensuring a rich learning experience. We provide ample opportunities for personal and professional growth, all while fostering an inclusive and barrier-free workplace.
Speer is an equal opportunity employer and is committed to providing an inclusive and barrier-free recruitment process. We will accommodate the needs of applicants under the Ontario Human Rights Code and the Accessibility for Ontarians with Disabilities Act (AODA) throughout all stages of the recruitment and selection process.
Please advise Speer of any accommodations you may require to ensure your equal participation in the recruitment and selection process. Information received relating to accommodation measures will be addressed confidentially.

Why Speer Technologies?

Growth Opportunities: We offer the chance to grow with the company and take on new responsibilities as we expand.
Dynamic Environment: Our fast-paced startup environment ensures no two days are the same.
Innovation: Be part of a team that's pushing the boundaries of technology and making a real impact.
Inclusive Workplace: We are committed to creating an inclusive environment where all employees can thrive.

Role Summary

The Site Reliability Engineer (SRE) is responsible for ensuring the availability, reliability, and operational excellence of production systems. This role bridges infrastructure engineering and operations by applying software engineering principles to infrastructure, monitoring, incident response, and continuous improvement. The SRE ensures systems meet defined uptime, performance, and resiliency targets as they scale.

Key Responsibilities

Reliability & Availability

Define and enforce SLAs, SLOs, and error budgets for infrastructure and applications.
Design and maintain monitoring, alerting, and health checks across network, hardware, and application layers.
Identify reliability risks and implement preventative controls.

Incident Management

Lead or participate in incident response and on-call rotations.
Reduce MTTR (Mean Time to Recovery) through automation, runbooks, and alert tuning.
Conduct post-incident reviews and drive corrective actions.

Observability & Monitoring

Implement centralized logging, metrics, and tracing.
Monitor system health including connectivity, latency, hardware status, and application availability.
Ensure alerts are actionable and aligned to business impact.

Automation & Tooling

Automate repetitive operational tasks.
Build self-healing mechanisms where possible.
Improve deployment and rollback reliability.

Collaboration & Architecture

Partner with Infrastructure Architects to validate resiliency and failover designs.
Support Systems Implementation Engineers during go-lives and major changes.
Provide operational feedback to improve future designs.

Documentation & Operational Readiness

Create and maintain runbooks, escalation paths, and recovery procedures.
Ensure operational readiness before systems enter production.
Continuously improve reliability through data-driven analysis.

Required Skills & Experience

Experience supporting production infrastructure with uptime or SLA requirements.
Strong understanding of networking concepts (latency, packet loss, redundancy).
Experience with monitoring and observability tools.
Familiarity with incident management and on-call practices.
Ability to automate operational workflows.
Strong troubleshooting and root-cause analysis skills.
Clear written and verbal communication.

Nice to Have

Experience supporting multi-site or distributed systems.
Exposure to IoT, access control, cameras, or physical infrastructure.
Experience in regulated or compliance-heavy environments.
Background in systems engineering or infrastructure architecture.
Fluency in French is an asset.

Job Details

Job Type: Full-Time
Pay: $70,000–$120,000 a year
Flexible language requirement: French not required
Schedule: Monday to Friday

Benefits

Dental care
Paid time off
Vision care
Wellness program

Top Skills

Automation

Monitoring Tools

Observability Tools

379 Shuter St, Toronto, ON , Canada, M5A 1X3

Similar Jobs

BuildOps

Site Reliability Engineer

17 Days Ago

Easy Apply

Hybrid

Toronto, ON, CAN

Easy Apply

Senior level

Cloud • Mobile • Software

Own and improve reliability domains end-to-end, implement SRE practices (SLIs/SLOs, error budgets), design observability, lead multi-team reliability projects, operate AWS/IaC environments, contribute code and automation, participate in on-call and incident response, mentor engineers, and document standards and runbooks to reduce toil and improve operability.

Top Skills: AWSDatadogEcsEksGrafanaHoneycombIncident.IoKubernetesLlms/Ai-Assisted ToolingNew RelicNode.jsOpsgeniePagerdutyPrometheusPythonTerraformTypescript

BuildOps

Site Reliability Engineer

18 Days Ago

Easy Apply

Hybrid

Toronto, ON, CAN

Easy Apply

Mid level

Cloud • Mobile • Software

Improve and protect production reliability and performance by implementing SRE practices (SLIs/SLOs, error budgets), building observability, evolving AWS infrastructure with Terraform, contributing automation and code, participating in incident response, and documenting runbooks and standards across teams.

Top Skills: AWSDatadogDockerEcsEksGrafanaHoneycombIncident.IoKubernetesLlmsNew RelicNode.jsOpsgeniePagerdutyPrometheusPythonTerraformTypescript

Movable Ink

Site Reliability Engineer

24 Days Ago

Easy Apply

Hybrid

Toronto, ON, CAN

Easy Apply

Senior level

Artificial Intelligence • Marketing Tech • Software

Lead technical reliability initiatives across a multi-cloud, multi-region active-active content platform. Architect and evolve core services, observability and logging, automation and capacity planning. Mentor engineers, drive cross-team reliability projects, define standards (IaC, SLOs, on-call) and proactively improve platform scalability and incident outcomes.

Top Skills: Apache Pulsar,Apache Kafka,Grafana Loki,Scylladb,Cassandra,Prometheus,Thanos,Grafana Alloy,Tempo,Terraform,Chef,Eks,Gke,Kubernetes,Nodejs,Golang,Ruby,Python,Shell Scripting,Linux,Aws,Gcp

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.

Speer

Site Reliability Engineer

Top Skills

Speer Toronto, Ontario, CAN Office

Similar Jobs

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

What you need to know about the Toronto Tech Scene