IKO North America Logo

IKO North America

Infrastructure Reliability Engineer

Posted 2 Days Ago
Be an Early Applicant
In-Office
Mississauga, ON, CAN
Senior level
In-Office
Mississauga, ON, CAN
Senior level
The IT Infrastructure Reliability Engineer ensures enterprise technology systems' availability and performance, focusing on monitoring, observability, application performance, and change management. This role involves collaborating on architecture and governance decisions while mentoring junior engineers and maintaining operational runbooks.
The summary above was generated by AI

IKO Industries Ltd. is a market leader in the manufacturing of roofing and building materials. IKO is a Canadian owned and operated business with production facilities worldwide and has many years of unparalleled success in the roofing materials industry. Quality, integrity, and trustworthiness are the values that underlie this success, and we have built this company by hiring people who hold these values. People like you!
 

Job Description

.

IT Infrastructure Reliability Engineer

Department: Infrastructure & Operations
Reports To: Global Director, Infrastructure & Operations
Employment Type: Full-Time
Location: On-Site
Compensation: $110,000–$129,000

Position Summary

The IT Infrastructure Reliability Engineer plays a critical role in ensuring the availability, performance, and resilience of enterprise technology systems across a complex, globally distributed environment. Reporting to the Global Director of Infrastructure and Operations, this individual will serve as a subject matter expert in observability, monitoring, alerting, application performance, while actively contributing to governance and architectural decisions through membership on the Architecture Review Board.

Key Responsibilities

Monitoring, Observability & Alerting

  • Design, implement, and maintain comprehensive monitoring solutions across on-premises, cloud, and hybrid infrastructure environments.
  • Develop observability frameworks leveraging metrics, logs, and distributed tracing to provide end-to-end visibility into system health and performance.
  • Define and manage alerting thresholds, escalation policies, and on-call runbooks to enable rapid incident detection and response.
  • Continuously evaluate and improve monitoring tooling (e.g., SolarWinds, Prometheus, Grafana, Splunk, Dynatrace) to align with organizational needs.
  • Establish SLOs, SLIs, and error budgets to measure and communicate reliability targets to business and technical stakeholders.

Application Performance Monitoring (APM)

  • Lead the deployment and optimization of APM tools to monitor application response times, throughput, error rates, and resource utilization.
  • Collaborate with development teams to instrument applications where applicable and integrate performance monitoring into development pipelines.
  • Conduct proactive performance analysis to identify bottlenecks, regressions, and optimization opportunities before they impact end users.
  • Develop dashboards and reports that surface actionable insights for engineering, operations, and leadership teams.
  • Participate in post-incident reviews to identify root causes and drive improvements to application reliability and observability.

Change Management Coordination

  • Serve as a technical liaison in the Change Advisory Board (CAB) process, evaluating infrastructure and platform changes for reliability risk.
  • Evaluate and improve change management standards, including pre-change testing, rollback planning, and post-change validation procedures.
  • Coordinate scheduled maintenance windows and communicate impact assessments to stakeholders and service owners.
  • Maintain change records and audit trails in the ITSM platform (ServiceNow) to support compliance and reporting.
  • Champion a culture of disciplined, risk-aware change practices across the I&O team.

Architecture Review Board (ARB) Membership

  • Participate as a standing member of the Architecture Review Board, providing reliability, observability, and operational readiness input on proposed solutions.
  • Review and assess new infrastructure designs, cloud services, and technology platforms for alignment with reliability engineering standards.
  • Contribute to the development and maintenance of architecture principles, infrastructure reference architectures, and technology standards.
  • Work cross-functionally with Enterprise Architects, Security, and Development teams to ensure new capabilities are designed for operability and resilience.
  • Document ARB decisions and provide post-implementation feedback loops to inform future architectural guidance.

Additional Responsibilities

  • Develop and maintain infrastructure-as-code (IaC) for monitoring configurations, ensuring consistency and version control.
  • Support capacity planning efforts by analyzing trends in resource consumption and forecasting future infrastructure requirements.
  • Mentor junior engineers in reliability engineering principles, tooling, and best practices.
  • Contribute to the development of disaster recovery and business continuity plans, including regular DR testing.
  • Maintain up-to-date documentation for all monitoring, alerting, and operational runbooks.
Qualifications

Required

  • 5+ years of experience in IT infrastructure, site reliability engineering (SRE), or a related operations role.
  • Demonstrated expertise in monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, Dynatrace, New Relic, or Splunk).
  • Solid understanding of APM concepts and hands-on experience instrumenting applications in enterprise environments.
  • Experience with ITSM and change management processes (ITIL certification preferred).
  • Proficiency with cloud platforms (AWS, Azure, GCP, OCI) and hybrid infrastructure architectures.
  • Familiarity with containerization and orchestration technologies (Docker, Kubernetes).
  • Experience with scripting or automation languages (Python, PowerShell…) and infrastructure-as-code tools (Ansible, Terraform).
  • Strong communication skills with the ability to convey complex technical information to both technical and non-technical audiences.

Preferred

  • Experience in a formal Site Reliability Engineering (SRE) function with ownership of SLOs and error budgets.
  • Background in enterprise architecture governance or participation in architecture review processes.
  • Certifications such as AWS Solutions Architect, Google Professional Cloud Architect, ITIL v4, or CKA/CKAD.
  • Familiarity with observability frameworks such as OpenTelemetry.
  • Experience in regulated industries with compliance-driven change controls.
Core Competencies

Technical Excellence

  • Deep infrastructure expertise
  • Systems‑level thinking
  • Automation‑first mindset
  • Security and compliance awareness

Collaboration & Influence

  • Cross‑functional partnership
  • Stakeholder communication
  • Architecture governance participation
  • Mentorship and knowledge sharing

Operational Mindset

  • Reliability and availability focus
  • Incident ownership
  • Continuous improvement
  • Risk‑aware change management
Working Conditions

This role may require participation in an on‑call rotation and availability outside standard business hours for critical incidents. Occasional travel may be required to support multi‑site operations.

Benefits of Employment: IKO recognizes that its success is due to the strength of its employees. A primary goal of IKO is to promote individual employee's sense of accomplishment and contribution so that employees enjoy their association with IKO. The Company invests in its employees so that they are the most knowledgeable in the industry, and undertakes great efforts to nurture loyalty to, and teamwork at, IKO. We are pleased to offer competitive compensation, health care, a progressive and challenging workplace and a commitment to teamwork and integrity.
 

Diversity and Equal Opportunity Employment: IKO Industries Ltd. is an equal opportunity employer. We are committed to diversity and inclusion and are pleased to consider all qualified applicants for employment without consideration to race, religion, creed, color, national origin, age, gender, sexual orientation, marital status, veteran status or disability. IKO Industries Ltd. encourages and welcomes applications from people with disabilities. Accommodations are available on request for candidates taking part in all aspects of the selection process.

Top Skills

Ansible
Apm Tools
AWS
Azure
Datadog
Docker
Dynatrace
GCP
Grafana
Itsm
Kubernetes
Monitoring Solutions
Observability Frameworks
Oci
Opentelemetry
Powershell
Prometheus
Python
Servicenow
Solarwinds
Splunk
Terraform

IKO North America Brampton, Ontario, CAN Office

71 Orenda Rd, Brampton, ON , Canada, L6W 1V8

Similar Jobs

9 Days Ago
In-Office
Toronto, ON, CAN
Mid level
Mid level
Healthtech • Software
As a Software Engineer - SRE/Infrastructure, you will scale Verto's cloud infrastructure, enhance system reliability, and optimize operations to improve healthcare delivery.
Top Skills: Angular JsAnsibleArgo CdAWSAzureAzure PipelinesBashCentosChefClickhouseDebianDockerElkGCPGithub ActionsGitlabGrafanaJenkinsKubernetesMongoDBPerlPostgresPrometheusPromqlPuppetPythonPython FlaskRhelRubyRuby On RailsTerraformUbuntu
17 Days Ago
In-Office or Remote
Toronto, ON, CAN
Senior level
Senior level
Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Generative AI
The Site Reliability Engineer will develop, deploy, and operate AI infrastructure, focusing on high-performance and scalable machine learning systems using Kubernetes and cloud platforms.
Top Skills: AWSAzureC++GCPGoKubernetesOci
An Hour Ago
In-Office
Mississauga, ON, CAN
Mid level
Mid level
Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)
The role involves advising on union matters, compliance with employment laws, and ensuring alignment with collective bargaining agreements. It requires strategic coaching, policy implementation, and collaboration with various HR disciplines to manage employment-related risks effectively.
Top Skills: Ai ConceptsCanadian Employment LawGenerative Ai ApplicationsIndustrial Relations

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account