BuildOps

Site Reliability Engineer

Posted 20 Days Ago

Be an Early Applicant

Easy Apply

Hybrid

Toronto, ON

Mid level

Easy Apply

Hybrid

Toronto, ON

Mid level

Improve and protect production reliability and performance by implementing SRE practices (SLIs/SLOs, error budgets), building observability, evolving AWS infrastructure with Terraform, contributing automation and code, participating in incident response, and documenting runbooks and standards across teams.

The summary above was generated by AI

At BuildOps, we’re building a software platform that empowers today’s commercial contractors. From service management to project execution, we’re reimagining how our customers operate. Our team thrives on ambition, innovation, and collaboration – qualities we look for in every new hire.

You will join our cloud infrastructure and reliability engineering team as a Site Reliability Engineer (SRE). Your primary responsibility will be to improve and protect the reliability, performance, and operability of our production systems while helping evolve our AWS-based infrastructure. We’re looking for someone with a strong SRE mindset, solid software engineering fundamentals, and deep observability expertise who can work effectively in a distributed team environment.

Reporting to the DevOps and SRE Manager, this is a hands-on role where you will influence reliability strategy, build tooling and automation, and contribute directly to day-to-day operations in a fast-moving, industry-defining company.

What You’ll Do

Drive and refine modern SRE practices across services, including SLIs/SLOs, error budgets, and reliability reviews
Design and maintain end-to-end observability (metrics, logs, traces, dashboards, and alerts) so teams can quickly detect, debug, and prevent issues
Partner with product and engineering teams to design reliable services—reviewing architectures, failure modes, rollout strategies, and capacity/latency considerations
Help evolve and operate our AWS infrastructure (networking, compute, data stores) using Infrastructure as Code (Terraform)
Contribute code to services, tooling, and automation (for example, reliability libraries, deployment and incident tooling, health checks)
Define, implement, and iterate on SLIs, SLOs, and error budgets with service owners, and use them to guide reliability work and release decisions
Participate in incident response for infrastructure-related production issues, including learning-focused post-incident reviews and follow-through on action items
Develop runbooks, safeguards, and automation that reduce manual work, improve time-to-diagnosis, and standardize responses to recurring scenarios
Advocate for and implement security and compliance best practices in production environments
Document standards, playbooks, and best practices so reliability improvements scale across teams
Collaborate closely with software engineers, product managers, and other stakeholders to plan and deliver reliability-focused initiatives

What We Look For

3+ years of professional experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering, working on production systems and reliability-focused initiatives

Thorough understanding of and hands-on experience with modern SRE practices, such as:

- Defining and implementing SLIs/SLOs and error budgets
- Reducing toil through automation
- Safe deployment and rollout patterns
- Structured post-incident reviews and continuous improvement
Some software engineering experience required: you’ve written and maintained production-quality code and can work comfortably in at least one modern language (for example, Python or Node.js/TypeScript)
Interested in using LLMs to assist in work, with at least some experience doing so

Strong observability skills:

- Designing metrics, logging, and tracing for multi-service systems
- Building actionable dashboards and alerts with clear runbooks
- Correlating metrics, logs, and traces to debug complex issues
Experience with tools such as Datadog, Prometheus, Grafana, Honeycomb, or New Relic (we use Datadog, but vendor-agnostic experience is welcome)
Experience working with AWS in production and with core platform primitives such as Terraform-based Infrastructure as Code and container/orchestration platforms (for example, Docker with ECS, EKS, or Kubernetes)

Incident management experience is a strong plus, including:

- Participating in or coordinating incident response
- Working within an incident management tool (for example, incident.io, PagerDuty, Opsgenie, or similar)
- Helping teams implement durable, high-leverage follow-ups
Strong communication skills and the ability to explain complex technical topics to both technical and non-technical audiences
CS degree or equivalent experience running production systems; we are equally interested in people from non-traditional backgrounds who have spent time operating real-world environments
Ability to work a hybrid schedule – Monday/Friday WFH; Tuesday–Thursday in-office

Compensation

$116,000 - $150,000 CAD base salary range + annual bonus

What we offer:

Generous equity grant, become an owner in our company!
Macbook computer provided
A comprehensive benefits package
Flexible PTO and hybrid work schedules
Work from home stipend
Hubs in Los Angeles, San Francisco, Toronto, and Raleigh with hybrid work schedules and lunch provided for in-office days
Company events like BBQs and team-building activities, both in-person and virtual
Fast-paced, collaborative, and dynamic work environment
Opportunities for growth and career advancement
Chance to work with cutting-edge technology and innovative solutions
The chance to get in on the ground floor and build something truly groundbreaking for ourselves and our amazing customers

We welcome applicants from across the U.S. where we are registered to do business and able to support employment. Currently, this excludes the following states: Alaska, Hawaii, Kentucky, Mississippi, Nebraska, New Mexico, North Dakota, Rhode Island, South Dakota, West Virginia, and Wyoming. This list is based solely on operational and compliance considerations and is reviewed from time to time as our footprint grows.

About BuildOps

Join BuildOps, the largest commercial trade platform in the country, as we transform the multi-billion dollar commercial contracting industry!

We’re not just talking incremental improvements—we’re talking a full-scale revolution, empowering the hardworking heroes who build and maintain the infrastructure that keeps our world running. See why contractors choose Buildops here.

This is your chance to be part of a rocketship. We’re fresh off a $1 billion valuation and a $127M Series C funding round (part of over $275M raised to date) led by industry-leading investors like Meritech Capital, BOND, and SE Ventures, backed by Schneider Electric (Reuters, TechCrunch, LA Business Journal) . Our latest investors join our team of industry heavyweights like Next47, former Twitter CEO Dick Costolo, former Salesforce President Gavin Patterson, and Boost Mobile CEO Stephen Stokols. Their investment is fueling our aggressive growth and our commitment to equipping contractors with AI-driven tools to conquer chaos, boost efficiency, skyrocket profitability, and ultimately, deliver exceptional service.

At BuildOps, we’re changing the game and doing the best work of our careers. You’ll be a key player in a company that’s truly making a difference for the backbone of our economy. If you’re ready to tackle big challenges, work with a passionate team, and build something extraordinary, BuildOps is the place for you. 🚀

BuildOps is an equal opportunity employer. We consider all qualified applicants without regard to race, color, religion, sex (including pregnancy, gender identity, and sexual orientation), national origin, age, disability, genetic information, veteran status, or any other status protected by applicable federal, state, or local law.

BuildOps will consider qualified applicants with a criminal history pursuant to the California Fair Chance Act pursuant to applicable local and state laws.

Top Skills

AWS

Datadog

Docker

Ecs

Eks

Grafana

Honeycomb

Incident.Io

Kubernetes

Llms

New Relic

Node.js

Opsgenie

Pagerduty

Prometheus

Python

Terraform

Typescript

325 Front Street W, Toronto, Ontario, Canada, M5V 2Y1

Similar Jobs at BuildOps

BuildOps

Site Reliability Engineer

19 Days Ago

Easy Apply

Hybrid

Toronto, ON, CAN

Easy Apply

Senior level

Cloud • Mobile • Software

Own and improve reliability domains end-to-end, implement SRE practices (SLIs/SLOs, error budgets), design observability, lead multi-team reliability projects, operate AWS/IaC environments, contribute code and automation, participate in on-call and incident response, mentor engineers, and document standards and runbooks to reduce toil and improve operability.

Top Skills: AWSDatadogEcsEksGrafanaHoneycombIncident.IoKubernetesLlms/Ai-Assisted ToolingNew RelicNode.jsOpsgeniePagerdutyPrometheusPythonTerraformTypescript

BuildOps

Senior Site Reliability Engineer

19 Days Ago

Easy Apply

Hybrid

Toronto, ON, CAN

Easy Apply

Senior level

Cloud • Mobile • Software

Lead SRE initiatives to improve reliability, observability, and automation of AWS-based production systems. Build SLIs/SLOs, maintain metrics/logs/traces, evolve Terraform infrastructure, contribute code and tooling, participate in incident response and runbook development, and collaborate with product and engineering teams to design resilient services.

Top Skills: Ai-Assisted ToolingAWSDatadogDockerEcsEksGrafanaHoneycombIncident.IoKubernetesLlmsNew RelicNode.jsOpsgeniePagerdutyPrometheusPythonTerraformTypescript

BuildOps

Sales Development Representative

3 Days Ago

Easy Apply

Hybrid

Toronto, ON, CAN

Easy Apply

Junior

Cloud • Mobile • Software

The Sales Development Representative identifies and sources qualified leads for the sales team while mastering product knowledge and achieving pipeline goals.

Top Skills: SalesforceSalesloftSeamless.Ai

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.

BuildOps

Site Reliability Engineer

Top Skills

BuildOps Toronto, Ontario, CAN Office

Similar Jobs at BuildOps

Site Reliability Engineer

Senior Site Reliability Engineer

Sales Development Representative

What you need to know about the Toronto Tech Scene