The Site Reliability Engineer will scale cloud services, manage caching infrastructure, and improve service reliability and performance. Responsibilities include building monitoring into the code, defining alerts, and automating tasks. Programming expertise, particularly in backend languages, and strong communication skills are essential as the role involves collaborating with both technical and non-technical audiences.
We are looking for an engineer who is passionate about scaling cloud services to join our growing SRE team. The SRE team owns the caching infrastructure, tooling, and automation that support Atlassian's suite of Cloud products.
We'd love it if you had an understanding of modern cloud infrastructure, programming expertise, operational experience and a desire to change the status quo. We're looking for an engineer who can analyze and help improve our services and processes to get us to an even higher level of reliability, performance, scalability, and cost efficiency.
On your first day, we'll expect you to have:
- 1+ years experience operating high-availability, fault-tolerant, scalable, distributed software in production: building monitoring into your code, tweaking dashboards, defining alerts, writing runbooks, etc.
- 1+ years of hands-on experience with public cloud offerings (AWS components like EC2, CloudFormation, RDS / Aurora, Caches, SQS - or equivalents, e.g. in GCP / Azure).
- Familiarity with Unix / Linux operating systems.
- Great emphasis to debug, improve code, and automate routine tasks.
- Backend engineering experience in one or more prominent languages such as Java, Go or Python.
- Strong communication skills in written and verbal forms, and an ability to communicate complex technical issues to a range of technical and non-technical audiences (management, peers, clients)
It would be great, but not mandatory if you had:
- Experience implementing caching solutions, strategies, and best practices.
- Experience in microservice architecture.
- Experience building web-services and clients using REST/GraphQL.
Top Skills
Go
Java
Python
Similar Jobs at Atlassian
The Site Reliability Engineer will join the SRE team to manage and improve the caching infrastructure and automation for Atlassian's cloud products. Responsibilities include ensuring high-availability systems, managing public cloud services, developing and debugging code, and automating tasks.
The Principal Site Reliability Engineer will enhance service reliability and performance by collaborating with various teams to implement reliability practices. The role requires deep expertise in cloud infrastructure and operational experience while mentoring other engineers and driving large-scale initiatives.
As a Senior Site Reliability Engineer, you will scale Cloud services, manage caching infrastructure, mentor team members, improve reliability and performance, and automate tasks. You will leverage your cloud expertise and engineering skills to enhance production systems.
What you need to know about the Toronto Tech Scene
Although home to some of the biggest names in tech, including Google, Microsoft and Amazon, Toronto has established itself as one of the largest startup ecosystems in the world. And with over 2,000 startups — more than 30 percent of the country's total startups — Toronto continues to attract new businesses. Be it helping entrepreneurs manage their finances, simplifying business operations by automating payroll or assisting pharmaceutical companies in launching new drugs, the city's tech scene is just getting started.