About the Role:

We are looking for a Site Reliability Engineer (SRE) with solid experience running production systems and working closely with development teams. The ideal candidate is comfortable with Linux, containers, Kubernetes, and CI/CD pipelines, and has a strong focus on reliability, monitoring, and incident handling. You will help keep our services stable, observable, and scalable while collaborating with engineers across the stack.

Responsibilities:

• Operate and maintain production systems with a focus on reliability, availability, and performance.

• Work with Docker and Kubernetes to deploy, update, and troubleshoot services.

• Configure and optimize Kubernetes resources (pods, deployments, services, ingress, config maps, secrets, etc.).

• Implement and maintain monitoring, logging, and alerting for applications and infrastructure.

• Build and improve CI/CD pipelines in collaboration with development and DevOps teams.

• Create and maintain dashboards for key service metrics (latency, error rate, throughput, resource usage).

• Participate in incident response: investigate issues, identify root cause, and propose fixes and improvements.

• Work closely with backend developers to improve service reliability, resilience, and observability.

• Contribute to capacity planning and performance tuning of services and infrastructure.

• Automate repetitive operational tasks using scripts or small tools.

• Document runbooks, procedures, and best practices for operating services in production.

Must-Have Qualifications:

• 3–5 years of professional experience in an SRE, DevOps, or infrastructure-focused engineering role.

• Strong understanding of Linux systems (shell, processes, networking, permissions, logs).

• Hands-on experience with Docker and Kubernetes in real environments.

• Practical experience with:

o Kubernetes deployments, services, ingress, config maps, and secrets o Basic troubleshooting inside a cluster (pods failing, crashes, restarts, resource issues)

• Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK/EFK, Application Insights, or similar).

• Experience with CI/CD pipelines (Azure DevOps, GitHub Actions, GitLab CI, Jenkins, or similar).

• Ability to read and modify pipeline definitions and understand build → test → deploy flows.

• Basic programming/scripting skills in at least one language (e.g., Python, Bash, PowerShell, Go, etc.).

• Understanding of core reliability concepts such as SLIs, SLOs, uptime, latency, and availability.

• Experience troubleshooting production issues using logs, metrics, and dashboards.

• Good communication skills and ability to collaborate with developers, QA, and product teams.

Nice-to-Have:

• Experience with at least one major cloud platform (Azure, AWS, Alibaba Cloud, or GCP).

• Experience with infrastructure as code (Terraform, Bicep, Pulumi, Helm, etc.).

• Experience with ingress controllers, API gateways, or service mesh.

• Familiarity with security best practices (secrets management, TLS/certificates, RBAC on Kubernetes or cloud).

• Experience participating in on-call rotations and using incident management tools (PagerDuty, Opsgenie, etc.).

• Experience contributing to post-incident reviews and implementing follow-up improvements.

Experience:

3–5 years

Site Reliability Engineer (SRE)

Site Reliability Engineer needed to keep production stable, observable, and scalable. Collaborate across teams on Linux, Kubernetes, CI/CD, monitoring, and incident response.

About Prime Gate

Site Reliability Engineer (SRE)

Site Reliability Engineer needed to keep production stable, observable, and scalable. Collaborate across teams on Linux, Kubernetes, CI/CD, monitoring, and incident response.

Already working at Prime Gate?