The Site Reliability Engineer ensures operational excellence, reliability, and uptime of production systems through monitoring, incident response, and automation. Responsibilities include defining SLAs, leading incident management, implementing observability, and collaborating on system designs.
Who we are!
- Speer Technologies is a dynamic technology hub based in Toronto, partnered with some of the largest technology incubators in the Greater Toronto Area. We are a team of passionate innovators and open-minded thinkers, dedicated to building groundbreaking technologies. Our products are on the path to receiving FDA and ADA approvals or provisional patents, with partnerships spanning Italy, Germany, California, and France.
- As a startup, we thrive on creativity, collaboration, and the drive to push boundaries. Our fast-paced environment offers exposure to a variety of programming languages, software, and work environments, ensuring a rich learning experience. We provide ample opportunities for personal and professional growth, all while fostering an inclusive and barrier-free workplace.
- Speer is an equal opportunity employer and is committed to providing an inclusive and barrier-free recruitment process. We will accommodate the needs of applicants under the Ontario Human Rights Code and the Accessibility for Ontarians with Disabilities Act (AODA) throughout all stages of the recruitment and selection process.
- Please advise Speer of any accommodations you may require to ensure your equal participation in the recruitment and selection process. Information received relating to accommodation measures will be addressed confidentially.
- Growth Opportunities: We offer the chance to grow with the company and take on new responsibilities as we expand.
- Dynamic Environment: Our fast-paced startup environment ensures no two days are the same.
- Innovation: Be part of a team that's pushing the boundaries of technology and making a real impact.
- Inclusive Workplace: We are committed to creating an inclusive environment where all employees can thrive.
The Site Reliability Engineer (SRE) is responsible for ensuring the availability, reliability, and operational excellence of production systems. This role bridges infrastructure engineering and operations by applying software engineering principles to infrastructure, monitoring, incident response, and continuous improvement. The SRE ensures systems meet defined uptime, performance, and resiliency targets as they scale.
Key ResponsibilitiesReliability & Availability
- Define and enforce SLAs, SLOs, and error budgets for infrastructure and applications.
- Design and maintain monitoring, alerting, and health checks across network, hardware, and application layers.
- Identify reliability risks and implement preventative controls.
Incident Management
- Lead or participate in incident response and on-call rotations.
- Reduce MTTR (Mean Time to Recovery) through automation, runbooks, and alert tuning.
- Conduct post-incident reviews and drive corrective actions.
Observability & Monitoring
- Implement centralized logging, metrics, and tracing.
- Monitor system health including connectivity, latency, hardware status, and application availability.
- Ensure alerts are actionable and aligned to business impact.
Automation & Tooling
- Automate repetitive operational tasks.
- Build self-healing mechanisms where possible.
- Improve deployment and rollback reliability.
Collaboration & Architecture
- Partner with Infrastructure Architects to validate resiliency and failover designs.
- Support Systems Implementation Engineers during go-lives and major changes.
- Provide operational feedback to improve future designs.
Documentation & Operational Readiness
- Create and maintain runbooks, escalation paths, and recovery procedures.
- Ensure operational readiness before systems enter production.
- Continuously improve reliability through data-driven analysis.
- Experience supporting production infrastructure with uptime or SLA requirements.
- Strong understanding of networking concepts (latency, packet loss, redundancy).
- Experience with monitoring and observability tools.
- Familiarity with incident management and on-call practices.
- Ability to automate operational workflows.
- Strong troubleshooting and root-cause analysis skills.
- Clear written and verbal communication.
- Experience supporting multi-site or distributed systems.
- Exposure to IoT, access control, cameras, or physical infrastructure.
- Experience in regulated or compliance-heavy environments.
- Background in systems engineering or infrastructure architecture.
- Fluency in French is an asset.
- Job Type: Full-Time
- Pay: $70,000–$120,000 a year
- Flexible language requirement: French not required
- Schedule: 8 hour shift, Monday to Friday, Overtime
- Dental care
- Paid time off
- Vision care
- Wellness program
Top Skills
Automation
Monitoring Tools
Observability Tools
Speer Toronto, Ontario, CAN Office
379 Shuter St, Toronto, ON , Canada, M5A 1X3
Similar Jobs
Artificial Intelligence • Cloud • Information Technology • Legal Tech • Productivity • Software
As a Site Reliability Engineer, you will create middleware, automate processes, maintain cloud infrastructure, and participate in on-call rotations while driving improvements in platform security and resilience.
Top Skills:
AksAzureBashChefDockerEfkElkGoGrafanaJavaKubernetesPowershellPrometheusPythonRubyTerraform
Artificial Intelligence • Machine Learning
The role involves managing HPC infrastructure, deploying automation tools, troubleshooting issues, and supporting ML teams in optimizing cluster operations.
Top Skills:
AnsibleAWSAzureBashCephGCPGitopsKubernetesLinuxNvidia GpusPythonPyTorchTensorFlowTerraform
Healthtech • Software
As a Software Engineer - SRE/Infrastructure, you will scale Verto's cloud infrastructure, enhance system reliability, and optimize operations to improve healthcare delivery.
Top Skills:
Angular JsAnsibleArgo CdAWSAzureAzure PipelinesBashCentosChefClickhouseDebianDockerElkGCPGithub ActionsGitlabGrafanaJenkinsKubernetesMongoDBPerlPostgresPrometheusPromqlPuppetPythonPython FlaskRhelRubyRuby On RailsTerraformUbuntu
What you need to know about the Toronto Tech Scene
Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.


