Citi Jobs

Site Reliability Engineering Lead

Citi

Site Reliability Engineering Lead

Reposted 7 Days Ago

Be an Early Applicant

In-Office

Mississauga, ON, CAN

Senior level

In-Office

Mississauga, ON, CAN

Senior level

Lead a team to ensure the stability and reliability of AI and DevOps platforms by improving operational efficiencies, incident management, and collaborating with development teams. Assist in capacity management and automation initiatives, and oversee production platform health.

The summary above was generated by AI

We are seeking an experienced and motivated team member to support our AI and DevOps Platform Support team in North America. This role is responsible for contributing to the stability, reliability, and performance of our critical AI and DevOps platforms. The team supports a wide range of services, including multiple AI applications, developer tools, and CI/CD pipeline technologies used across the organization. The ideal candidate will help lead a team of SRE and Support engineers, facilitate incident and problem resolution, and collaborate with engineering and development teams to enhance platform services and supportability. The role includes short‑term planning and coordination of actions and resources within the team.Responsibilities• Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives.
• Assist with vendor relationship management, including coordination with offshore managed services.
• Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices.
• Partner with development teams to guide improvements in application stability and supportability.
• Contribute to frameworks for managing capacity, throughput, and latency.
• Assist in defining and implementing application onboarding guidelines and standards.
• Support team members by fostering a collaborative environment and encouraging skill development.
• Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training.
• Participate in business review meetings to help align technology tools and strategies with business requirements.
• Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program.
• Perform other duties and functions as assigned.
• Support platform leadership in defining the platform roadmap and partnering with engineering teams and business stakeholders.
• Assist in executing resilience activities such as wargaming scenarios, chaos engineering tests, and disaster recovery drills.
• Contribute to automation initiatives aimed at reducing manual toil and improving platform efficiency.
• Support the enterprise‑wide observability strategy, including monitoring, logging, tracing, and alerting.
• Maintain hands‑on familiarity with platform architecture and services as needed for operational support.
• Assist in overseeing the operational health of production platforms (including OpenShift, ECS, CI/CD), ensuring SLAs are supported and incident processes are followed.
• Help implement and operate effective monitoring and observability strategies to support proactive issue detection and system health assessments.Qualifications• 6+ years of relevant experience in a hands‑on technical or support leadership role.
• Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability.
• Experience working with senior stakeholders or technology partners.
• Demonstrated experience supporting IT service improvements or platform stability initiatives.
• Strong communication and presentation skills, with the ability to convey technical concepts clearly.
• Experience supporting or contributing to technical roadmaps or operational workstreams.
• Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing.
• Ability to collaborate with cross‑functional support teams and technology groups.
• Strong organizational and workload‑planning skills.
• Consistently demonstrates clear and concise written and verbal communication skills.
• Ability to communicate appropriately with relevant stakeholders.
• Working knowledge of Generative AI concepts preferred.
• Experience with CI/CD and configuration management tools preferred.
• Experience with Red Hat OpenShift or similar Kubernetes technologies preferred.
• Experience working with databases such as Postgres, Oracle, MongoDB, or Redis preferred.
• Experience writing or maintaining code in Java, Python, Go, or similar languages preferred.
• Hands‑on experience with modern observability and monitoring tools (e.g., Prometheus, Grafana, Splunk, ELK) preferred.Education• Bachelor’s/University degree required; Master’s degree preferred.

------------------------------------------------------

Job Family Group: Technology

------------------------------------------------------

Job Family:Applications Support

------------------------------------------------------

Time Type:Full time

------------------------------------------------------

Primary Location Full Time Salary Range:$120,800.00 - $170,800.00

------------------------------------------------------

Most Relevant Skills Please see the requirements listed above.

------------------------------------------------------

Other Relevant Skills For complementary skills, please see above and/or contact the recruiter.

------------------------------------------------------

Automated Processing and AI

We use automated processing, including artificial intelligence, for our legitimate business interests (or our reasonable and appropriate business purposes) to identify and align the candidate's skills and abilities with a specific job opening. Additionally, if you so choose, or consent, we can match your skills and abilities to other suitable roles at Citi.

Importantly, all our hiring processes and decisions, including determining your suitability for a role, are conducted, checked, and decided by individuals. Our automated processing and AI do not involve relying on automatic or autonomous decision-making. Please refer to any Jurisdictional Considerations, with specific provisions for your country (where relevant) for further details.

------------------------------------------------------

This job opening is for an existing job vacancy.

------------------------------------------------------

Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.

If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.
View Citi’s EEO Policy Statement and the Know Your Rights poster.

Similar Jobs

iManage

Senior Site Reliability Engineer

16 Days Ago

Hybrid

Toronto, ON, CAN

Senior level

Artificial Intelligence • Cloud • Information Technology • Legal Tech • Productivity • Software

The Senior Site Reliability Engineer at iManage will build and maintain resilient platforms, drive reliability best practices, and support cloud infrastructure with automation and scalable solutions.

Top Skills: AksAzureBashChefDockerEfkElkGoGrafanaJavaKubernetesLinuxPowershellPrometheusPythonRubyTerraform

SimCorp

Senior Site Reliability Engineer

Yesterday

In-Office

Toronto, ON, CAN

Senior level

Software

The Senior Site Reliability Engineer will enhance and support mission-critical environments, ensuring reliability and performance, and driving cloud transformation in Azure-based platforms while collaborating with cross-functional teams.

Top Skills: AnsibleApplication InsightsArmAWSAzure MonitorBashBicepDockerGrafanaKubernetesLog AnalyticsAzurePowershellSQLTerraform

Okta

Senior Site Reliability Engineer

21 Days Ago

In-Office

Toronto, ON, CAN

Senior level

Cloud

As a Senior Site Reliability Engineer, you'll enhance platform reliability, collaborate with engineering teams, and manage production systems while ensuring operational excellence and resilience for high-scale applications.

Top Skills: ArgocdAWSAzureDockerGCPGoKubernetesNoSQLSQLTerraform

What you need to know about the Toronto Tech Scene

Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.