Citi Logo

Citi

Site Reliability Engineering Lead

Reposted Yesterday
Be an Early Applicant
In-Office
Mississauga, ON, CAN
Senior level
In-Office
Mississauga, ON, CAN
Senior level
Lead a team to ensure the stability and reliability of AI and DevOps platforms by improving operational efficiencies, incident management, and collaborating with development teams. Assist in capacity management and automation initiatives, and oversee production platform health.
The summary above was generated by AI
We are seeking an experienced and motivated team member to support our AI and DevOps Platform Support team in North America. This role is responsible for contributing to the stability, reliability, and performance of our critical AI and DevOps platforms. The team supports a wide range of services, including multiple AI applications, developer tools, and CI/CD pipeline technologies used across the organization. The ideal candidate will help lead a team of SRE and Support engineers, facilitate incident and problem resolution, and collaborate with engineering and development teams to enhance platform services and supportability. The role includes short‑term planning and coordination of actions and resources within the team.Responsibilities• Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives.
• Assist with vendor relationship management, including coordination with offshore managed services.
• Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices.
• Partner with development teams to guide improvements in application stability and supportability.
• Contribute to frameworks for managing capacity, throughput, and latency.
• Assist in defining and implementing application onboarding guidelines and standards.
• Support team members by fostering a collaborative environment and encouraging skill development.
• Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training.
• Participate in business review meetings to help align technology tools and strategies with business requirements.
• Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program.
• Perform other duties and functions as assigned.
• Support platform leadership in defining the platform roadmap and partnering with engineering teams and business stakeholders.
• Assist in executing resilience activities such as wargaming scenarios, chaos engineering tests, and disaster recovery drills.
• Contribute to automation initiatives aimed at reducing manual toil and improving platform efficiency.
• Support the enterprise‑wide observability strategy, including monitoring, logging, tracing, and alerting.
• Maintain hands‑on familiarity with platform architecture and services as needed for operational support.
• Assist in overseeing the operational health of production platforms (including OpenShift, ECS, CI/CD), ensuring SLAs are supported and incident processes are followed.
• Help implement and operate effective monitoring and observability strategies to support proactive issue detection and system health assessments.
Qualifications• 6+ years of relevant experience in a hands‑on technical or support leadership role.
• Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability.
• Experience working with senior stakeholders or technology partners.
• Demonstrated experience supporting IT service improvements or platform stability initiatives.
• Strong communication and presentation skills, with the ability to convey technical concepts clearly.
• Experience supporting or contributing to technical roadmaps or operational workstreams.
• Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing.
• Ability to collaborate with cross‑functional support teams and technology groups.
• Strong organizational and workload‑planning skills.
• Consistently demonstrates clear and concise written and verbal communication skills.
• Ability to communicate appropriately with relevant stakeholders.
• Working knowledge of Generative AI concepts preferred.
• Experience with CI/CD and configuration management tools preferred.
• Experience with Red Hat OpenShift or similar Kubernetes technologies preferred.
• Experience working with databases such as Postgres, Oracle, MongoDB, or Redis preferred.
• Experience writing or maintaining code in Java, Python, Go, or similar languages preferred.
• Hands‑on experience with modern observability and monitoring tools (e.g., Prometheus, Grafana, Splunk, ELK) preferred.
Education• Bachelor’s/University degree required; Master’s degree preferred.

------------------------------------------------------

Job Family Group:

Technology

------------------------------------------------------

Job Family:

Applications Support

------------------------------------------------------

Time Type:

Full time

------------------------------------------------------

Primary Location Full Time Salary Range:

$120,800.00 - $170,800.00

------------------------------------------------------

Most Relevant Skills

Please see the requirements listed above.

------------------------------------------------------

Other Relevant Skills

For complementary skills, please see above and/or contact the recruiter.

------------------------------------------------------

Automated Processing and AI

We use automated processing, including artificial intelligence, for our legitimate business interests (or our reasonable and appropriate business purposes) to identify and align the candidate's skills and abilities with a specific job opening. Additionally, if you so choose, or consent, we can match your skills and abilities to other suitable roles at Citi.

Importantly, all our hiring processes and decisions, including determining your suitability for a role, are conducted, checked, and decided by individuals. Our automated processing and AI do not involve relying on automatic or autonomous decision-making. Please refer to any Jurisdictional Considerations, with specific provisions for your country (where relevant) for further details.

------------------------------------------------------

This job opening is for an existing job vacancy.

------------------------------------------------------

Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.

 

If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.
View Citi’s EEO Policy Statement and the Know Your Rights poster.

Top Skills

Ai Applications
Ci/Cd
DevOps
Ecs
Elk
Go
Grafana
Java
MongoDB
Openshift
Oracle
Postgres
Prometheus
Python
Redis
Splunk

Similar Jobs

18 Days Ago
In-Office
Toronto, ON, CAN
Senior level
Senior level
Software
The Senior Site Reliability Engineer ensures system availability and performance, automates operations, collaborates with teams, troubleshoots outages, and manages infrastructure scalability.
Top Skills: .NetActive DirectoryAzure DevopsDatadogDfsDnsGpoIisIpsec VpnMicrosoft Azure IaasMicrosoft Windows ServerPowershellRemote Desktop ServicesSQL ServerSQL ServerTerraformVisual Studio Team Services
16 Days Ago
In-Office
Markham, ON, CAN
Senior level
Senior level
Fintech • Financial Services
The Senior Site Reliability Engineer designs and supports technical infrastructure for Broadridge applications, ensuring reliability through automation and monitoring while collaborating across teams.
Top Skills: AnsibleAWSAzureBladelogicChefJenkinsLinuxPerlPowershellShell ScriptsTerraformWindows
18 Days Ago
In-Office
Toronto, ON, CAN
Senior level
Senior level
Fintech • Financial Services
As a Senior Site Reliability Engineer, you will manage production infrastructure, improve system reliability, lead incident response, and mentor team members, focusing on security and efficient operational practices.
Top Skills: Aurora RdsAWSDatadogDynamoDBElasticacheGithub ActionsKubernetesNode.jsPostgresTerraformTerragruntTypescript

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account