Reliability Engineer

January 31, 2026
400,000 / month
Application ends: February 28, 2026
Apply Now

Job Description

A reputable technology-driven organisation is seeking a skilled Reliability Engineer to strengthen platform stability, scalability, and performance. This role focuses on building reliable systems, improving observability, automating operations, and ensuring high availability across cloud-based infrastructure.

Key Responsibilities

  • Monitor system health, performance, and availability using tools such as Grafana, Prometheus, Datadog, or New Relic, and respond to incidents promptly.
  • Lead and document post-incident reviews, identify root causes, and implement preventive actions to avoid recurrence.
  • Develop scripts (Python, Bash) and use configuration management tools to automate deployments, operational tasks, and recovery procedures.
  • Build and maintain internal platforms and tools that enable self-healing systems, automated canary analysis, and large-scale performance tracing.
  • Collaborate with software teams to define Service Level Objectives (SLOs) and Error Budgets, and implement improvements to reduce manual toil and improve resilience.
  • Manage and optimise cloud resources across AWS, Google Cloud, or Azure with a focus on performance, scalability, and cost efficiency using Infrastructure as Code (IaC).
  • Design and implement chaos engineering practices, disaster recovery automation, and capacity planning initiatives.

Requirements

  • 3–5 years of experience in DevOps, SRE, Linux System Administration, or Backend Engineering roles.
  • Proficiency in at least one scripting or programming language such as Python or Go.
  • Hands-on experience with cloud platforms including AWS, Google Cloud, or Azure.
  • Practical experience with containerisation and orchestration tools such as Docker and Kubernetes.
  • Working knowledge of monitoring and observability tools.
  • Experience with CI/CD pipelines such as GitLab CI, Jenkins, or GitHub Actions.

Core Skills

  • Excellent problem-solving and troubleshooting skills, especially under pressure.
  • Strong understanding of network fundamentals including TCP/IP, DNS, and HTTP/S.
  • Knowledge of database performance and reliability across PostgreSQL, MySQL, or MongoDB.
  • A systematic approach to automation with a strong drive to reduce manual processes.
  • Clear communication skills for effective collaboration with technical and non-technical stakeholders.
  • Understanding of infrastructure security best practices.

How to Apply

Interested and qualified candidates should send their CV and portfolio using Reliability Engineer as the subject of the email.