Job Description
Moniepoint Incorporated is hiring a Senior Site Reliability Engineer to support and improve the reliability, scalability, and performance of its enterprise systems. This role is ideal for experienced engineers passionate about automation, observability, cloud infrastructure, and incident management. The successful candidate will play a key role in maintaining system stability while driving long-term reliability improvements across applications and infrastructure.
Responsibilities
- Participate in on-call rotations to detect, troubleshoot, and resolve outages, service degradation, and reliability issues across multiple environments.
- Lead major incident response processes, including coordinating cross-functional teams, managing communication updates, and documenting Root Cause Analyses (RCAs).
- Build and implement automation solutions that reduce repetitive operational tasks and improve system efficiency and resilience.
- Create, maintain, and optimize monitoring dashboards, alerting systems, and observability tools for infrastructure and applications.
- Collaborate with Product and Engineering teams to define and monitor Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- Contribute to feature development discussions to ensure observability and reliability are integrated from the early stages of development.
- Investigate and resolve escalated customer issues involving performance bottlenecks, system reliability, and complex infrastructure behavior.
Requirements
- Minimum of 4 years of experience in Site Reliability Engineering or a similar infrastructure-focused role.
- Strong understanding of distributed systems, microservices architecture, and software design principles.
- Hands-on experience with cloud platforms such as AWS, GCP, or Azure.
- Solid experience working with Kubernetes and container orchestration technologies.
- Knowledge of observability tools and monitoring platforms such as Grafana, Prometheus, New Relic, Datadog, ELK Stack, SigNoz, and OpenTelemetry.
- Excellent troubleshooting and problem-solving skills, particularly in high-pressure on-call environments.
- Ability to build and maintain reliable monitoring dashboards, alerts, and operational workflows.
