Site Reliability Engineer (SRE)
Site Reliability Engineer needed to keep production stable, observable, and scalable. Collaborate across teams on Linux, Kubernetes, CI/CD, monitoring, and incident response.
About the Role:
We are looking for a Site Reliability Engineer (SRE) with solid experience running production systems and working closely with development teams. The ideal candidate is comfortable with Linux, containers, Kubernetes, and CI/CD pipelines, and has a strong focus on reliability, monitoring, and incident handling. You will help keep our services stable, observable, and scalable while collaborating with engineers across the stack.
Responsibilities:
• Operate and maintain production systems with a focus on reliability, availability, and performance.
• Work with Docker and Kubernetes to deploy, update, and troubleshoot services.
• Configure and optimize Kubernetes resources (pods, deployments, services, ingress, config maps, secrets, etc.).
• Implement and maintain monitoring, logging, and alerting for applications and infrastructure.
• Build and improve CI/CD pipelines in collaboration with development and DevOps teams.
• Create and maintain dashboards for key service metrics (latency, error rate, throughput, resource usage).
• Participate in incident response: investigate issues, identify root cause, and propose fixes and improvements.
• Work closely with backend developers to improve service reliability, resilience, and observability.
• Contribute to capacity planning and performance tuning of services and infrastructure.
• Automate repetitive operational tasks using scripts or small tools.
• Document runbooks, procedures, and best practices for operating services in production.
Must-Have Qualifications:
• 3–5 years of professional experience in an SRE, DevOps, or infrastructure-focused engineering role.
• Strong understanding of Linux systems (shell, processes, networking, permissions, logs).
• Hands-on experience with Docker and Kubernetes in real environments.
• Practical experience with:
o Kubernetes deployments, services, ingress, config maps, and secrets o Basic troubleshooting inside a cluster (pods failing, crashes, restarts, resource issues)
• Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK/EFK, Application Insights, or similar).
• Experience with CI/CD pipelines (Azure DevOps, GitHub Actions, GitLab CI, Jenkins, or similar).
• Ability to read and modify pipeline definitions and understand build → test → deploy flows.
• Basic programming/scripting skills in at least one language (e.g., Python, Bash, PowerShell, Go, etc.).
• Understanding of core reliability concepts such as SLIs, SLOs, uptime, latency, and availability.
• Experience troubleshooting production issues using logs, metrics, and dashboards.
• Good communication skills and ability to collaborate with developers, QA, and product teams.
Nice-to-Have:
• Experience with at least one major cloud platform (Azure, AWS, Alibaba Cloud, or GCP).
• Experience with infrastructure as code (Terraform, Bicep, Pulumi, Helm, etc.).
• Experience with ingress controllers, API gateways, or service mesh.
• Familiarity with security best practices (secrets management, TLS/certificates, RBAC on Kubernetes or cloud).
• Experience participating in on-call rotations and using incident management tools (PagerDuty, Opsgenie, etc.).
• Experience contributing to post-incident reviews and implementing follow-up improvements.
Experience:
3–5 years
- Department
- Prime Digital
About Prime Gate
At Prime Gate, we are leaders in Infrastructure Technology System Integration with over two decades of expertise. Our mission is to provide innovative and reliable ICT solutions across industries, including telecommunications, IT, physical security, and digital services.
Committed to excellence, we partner with clients to transform their businesses, ensuring their systems are robust, secure, and future-ready.