TechInsights

Site Reliability Engineer/Developer

Posted Yesterday

Be an Early Applicant

In-Office

Ottawa, ON

Senior level

In-Office

Ottawa, ON

Senior level

The Site Reliability Engineer/Developer designs and maintains scalable cloud infrastructure, ensures system reliability, and develops automation tools for TechInsights' applications.

The summary above was generated by AI

OUR STORY
TechInsights is the information Platform for the semiconductor industry.
Regarded as the most trusted source of actionable, in-depth intelligence related to semiconductor innovation and surrounding markets, TechInsights’ content informs decision makers and professionals whose success depends on accurate knowledge of the semiconductor industry—past, present, or future.
Over 650 companies and 150,000 users access the TechInsights Platform, the world’s largest vertically integrated collection of unmatched reverse engineering, teardown, and market analysis in the semiconductor industry. This collection includes detailed circuit analysis, imagery, semiconductor process flows, device teardowns, illustrations, costing and pricing information, forecasts, market analysis, and expert commentary. TechInsights’ customers include the most successful technology companies who rely on TechInsights’ analysis to make informed business, design, and product decisions faster and with greater confidence. For more information, visit www.techinsights.com.
WHY WORK WITH US

Company-sponsored training and development opportunities
Comprehensive benefits package (health, dental, vision, wellness, RRSP Matching, annual fitness reimbursement)
Flexible vacation policy
Bring your own device program
Community involvement opportunities through charitable alliances: https://www.techinsights.com/community-involvement
Wellness resources and support
Inclusive environment that prioritizes diversity, equity, and accessibility
High-growth company driven by high performance
Expected salary range: $109,600 – 116,100 CAD

THE OPPORTUNITY:
The Site Reliability Developer is responsible for designing, implementing, and maintaining the reliable, scalable cloud infrastructure that powers TechInsights' semiconductor intelligence applications. This role sits at the intersection of software engineering and systems operations — building automation tools, establishing infrastructure patterns, and ensuring production environments consistently meet availability and performance standards across a multi-region AWS environment.
Working within the cloud operations team, the Site Reliability Developer brings advanced technical expertise to complex infrastructure challenges, applies site reliability engineering best practices, and partners with development teams to enable efficient, reliable software delivery at scale. This is a role for an engineer who can independently drive solutions to complex problems with a meaningful impact on operational and service-delivery outcomes.
The ideal candidate brings exceptional depth and breadth in reliability engineering and cloud infrastructure. They are equally adept at interpreting business and technical challenges, recommending improvements to infrastructure and processes, and delivering innovative solutions that raise the bar for operational excellence across the organization.
WHAT YOU’LL DO

Design, implement, and maintain highly available, scalable infrastructure systems across multi-region AWS deployments, ensuring production environments consistently meet availability and performance requirements.
Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) in collaboration with development teams, using metrics to quantify and continuously improve system reliability.
Monitor system performance, availability, and resource utilization using CloudWatch, DataDog, and Prometheus, proactively identifying optimization opportunities and conducting root cause analysis for outages and degradations.
Implement capacity planning strategies using historical data analysis and growth projections to ensure infrastructure scales ahead of demand, balanced against cost optimization using AWS Cost Explorer and Kubecost.
Create comprehensive infrastructure-as-code solutions using Terraform and GitOps methodologies to manage AWS resources consistently, securely, and repeatably.
Develop and maintain CI/CD pipelines using Jenkins, GitLab CI, or GitHub Actions to automate deployment processes with built-in testing and validation.
Implement and maintain containerization platforms using Docker and Kubernetes, establishing standards for container orchestration, cluster management, and reusable infrastructure patterns.
Build automation tools and scripts in Python, Go, or Java to eliminate manual operational tasks, reduce toil, and automate routine maintenance procedures including patching, backups, and resource cleanup.
Lead incident response for critical system outages and performance issues, coordinating cross-functional teams to diagnose and resolve problems with speed and precision.
Implement comprehensive observability solutions — including logging, monitoring, distributed tracing, and intelligent alerting via Grafana and PagerDuty — that ensure rapid response to genuine issues while minimizing alert fatigue.
Conduct blameless post-mortems and thorough post-incident reviews, documenting lessons learned and driving implementation of preventive measures and updated runbooks.
Develop and maintain disaster recovery procedures and business continuity plans, including regular testing, and collaborate with Security and Compliance teams to ensure monitoring systems meet audit and regulatory requirements.

WHAT YOU’LL BRING

Technical Requirements

Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience
5–7 years in Site Reliability Engineering, DevOps, or cloud operations
Strong AWS expertise (EC2, ECS/EKS, RDS, S3, Lambda, VPC) and hybrid cloud environments
Proficiency in Python, Go, or Java; experience with Docker, Kubernetes, and container orchestration
Expertise in infrastructure-as-code (Terraform, Ansible, CloudFormation) and CI/CD pipeline development
Experience with observability tools (Prometheus, Grafana, DataDog, CloudWatch, PagerDuty)
Solid foundation in Linux/Unix administration, networking, security, and database systems

Professional Skills

Independently solves complex problems and drives innovative infrastructure solutions with minimal guidance
Translates business challenges into infrastructure and process improvements
Communicates technical concepts effectively across technical and non-technical audiences
Leads projects and mentors junior engineers

Preferred Qualifications:

Experience in semiconductor or technology industry environments

AWS certifications (Solutions Architect, DevOps Engineer) or Kubernetes certifications (CKA, CKAD)

Experience with microservices architecture and distributed systems design

Knowledge of security frameworks and compliance requirements (SOC 2, ISO 27001)

Experience with database administration, performance tuning, and Agile/Scrum methodologies

Familiarity with service mesh technologies (Istio, Linkerd)

Contributions to open-source infrastructure projects

As part of the recruitment process for this position, you will be required to submit your latest citizenship and/or permanent residency information. This information will be used to comply with U.S. Export Control Laws and Regulations.
WORKING ARRANGEMENT

This is a remote position for candidates based in Canada
Occasional travel may be required

Top Skills

AWS

Cloudwatch

Datadog

Docker

Github Actions

Gitlab Ci

Grafana

Java

Jenkins

Kubernetes

Prometheus

Python

Terraform

Similar Jobs

Block

Senior Site Reliability Engineer

6 Days Ago

In-Office or Remote

Senior level

Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency

The Senior Site Reliability Engineer will enhance reliability of Block's platform, improve incident response using AI tools, and coordinate incident management. Responsibilities include building reliable systems, standardizing tools, and leading high-severity incidents during on-call rotations.

Top Skills: Amazon Web ServicesDatadogDynamoDBGrpcHTTPIstioJavaJSONKotlinKubernetesLaunchdarklyMySQLProtocol BuffersTerraformVitess

MongoDB

Site Reliability Engineer

8 Days Ago

Easy Apply

Hybrid

Toronto, ON, CAN

Easy Apply

Senior level

Big Data • Cloud • Software • Database

The Senior Site Reliability Engineer will design, optimize, and maintain MongoDB's multi-tenant distributed storage systems, ensuring reliability and operational safety while implementing automation solutions and participating in on-call support.

Top Skills: AWSAzureGoGoogle Cloud PlatformKubernetesLinuxPython

MongoDB

Site Reliability Engineer

8 Days Ago

Easy Apply

Hybrid

Toronto, ON, CAN

Easy Apply

Expert/Leader

Big Data • Cloud • Software • Database

The role involves building and maintaining secure multi-cloud infrastructure for communication between systems, incorporating networking and distributed systems expertise. Responsibilities include collaborating with teams for service connectivity and participating in a 24/7 on-call rotation.

Top Skills: AWSAzureBgpDnsGCPKubernetesSdnTcp/IpTls/Mtls

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.