Job description
IT
Site Reliability Expert (SRE)
Quebec
Simons Campus - IT
Full time
Are you looking to join our Information Technology team in a unique role that contributes to the optimal maintenance of our production environment? Join the Simons family as a Site Reliability Engineer (SRE).
The person in this role plays a key part in ensuring the smooth operation of our production environment by adopting a proactive, software-engineering-oriented approach. Reporting to the Director of Solution Architecture and Software Engineering, the SRE is responsible for ensuring the continuous availability of large-scale distributed software applications while maintaining high levels of performance and reliability.
Key Responsibilities:
- Provide primary operational support for multiple large-scale distributed software applications.
- Collect and analyze metrics from operating systems and applications to support performance optimization and incident troubleshooting.
- Measure and optimize system performance.
- Deliver infrastructure services using Infrastructure as Code (IaC).
- Maintain services that use the Operator Framework.
- Maintain and enhance continuous integration and continuous deployment (CI/CD) tools using ArgoCD and GitHub Actions.
- Automate IT operations tasks using Ansible.
- Participate in system design consultations, platform management, and capacity planning.
- Balance feature development velocity and reliability with well-defined service-level objectives.
- Collaborate with development teams to improve services through rigorous testing procedures.
- Build sustainable systems and services through automation and continuous improvement.
- Develop software and systems to manage platform infrastructure and applications.
Desired Profile:
- Bachelor’s degree in computer science, software engineering, IT engineering, electrical engineering, or any other relevant field.
- At least two (2) years of experience in a role related to DevOps, SRE, platform engineering, or software engineering.
- Experience with Kubernetes, preferably Red Hat OpenShift.
- Experience with full-stack observability platforms such as Datadog and New Relic.
- Practical coding knowledge beyond simple scripting.
- Strong understanding of cloud-native approaches.
- Advanced programming skills (structured and object-oriented) using one or more high-level languages such as Java, Python, C/C++, Go, and JavaScript.
- Proactive approach to identifying issues, performance bottlenecks, and areas for improvement.
- Strong teamwork abilities and communication skills to work effectively with diverse stakeholders in a constantly evolving environment.
- Ability to communicate effectively in both French and English, spoken and written, in order to use systems and tools and carry out various tasks in English.
Benefits Available:
- A telemedicine service and Employee and Family Assistance Program.
- Group insurance plan and RRSP.
- Up to 40% off Simons purchases.
- Fitness area with changing rooms, group classes, and kinesiology services.
- Cafeteria service offering an extensive and affordable menu.