Job Description
A reputable technology-driven organisation is seeking a skilled Reliability Engineer to strengthen platform stability, scalability, and performance. This role focuses on building reliable systems, improving observability, automating operations, and ensuring high availability across cloud-based infrastructure.
Key Responsibilities
- Monitor system health, performance, and availability using tools such as Grafana, Prometheus, Datadog, or New Relic, and respond to incidents promptly.
- Lead and document post-incident reviews, identify root causes, and implement preventive actions to avoid recurrence.
- Develop scripts (Python, Bash) and use configuration management tools to automate deployments, operational tasks, and recovery procedures.
- Build and maintain internal platforms and tools that enable self-healing systems, automated canary analysis, and large-scale performance tracing.
- Collaborate with software teams to define Service Level Objectives (SLOs) and Error Budgets, and implement improvements to reduce manual toil and improve resilience.
- Manage and optimise cloud resources across AWS, Google Cloud, or Azure with a focus on performance, scalability, and cost efficiency using Infrastructure as Code (IaC).
- Design and implement chaos engineering practices, disaster recovery automation, and capacity planning initiatives.
Requirements
- 3–5 years of experience in DevOps, SRE, Linux System Administration, or Backend Engineering roles.
- Proficiency in at least one scripting or programming language such as Python or Go.
- Hands-on experience with cloud platforms including AWS, Google Cloud, or Azure.
- Practical experience with containerisation and orchestration tools such as Docker and Kubernetes.
- Working knowledge of monitoring and observability tools.
- Experience with CI/CD pipelines such as GitLab CI, Jenkins, or GitHub Actions.
Core Skills
- Excellent problem-solving and troubleshooting skills, especially under pressure.
- Strong understanding of network fundamentals including TCP/IP, DNS, and HTTP/S.
- Knowledge of database performance and reliability across PostgreSQL, MySQL, or MongoDB.
- A systematic approach to automation with a strong drive to reduce manual processes.
- Clear communication skills for effective collaboration with technical and non-technical stakeholders.
- Understanding of infrastructure security best practices.
How to Apply
Interested and qualified candidates should send their CV and portfolio using Reliability Engineer as the subject of the email.

