Site Reliability Engineer 3

Location IN-Remote
ID 2026-10217
Position Type
Full-Time
Employee Type
Regular
Location Type
Hybrid

The Company

Serving the People Who Serve the People

 

Granicus is driven by the excitement of building, implementing, and maintaining technology that is transforming the Govtech industry by bringing governments and its constituents together. We are on a mission to support our customers with meeting the needs of their communities and implementing our technology in ways that are equitable and inclusive. Granicus has consistently appeared on the GovTech 100 list over the past 5 years and has been recognized as the best companies to work on BuiltIn.

 

Over the last 25 years, we have served 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers power an unmatched Subscriber Network that use our digital solutions to make the world a better place. With comprehensive cloud-based solutions for communications, government website design, meeting and agenda management software, records management, and digital services, Granicus empowers stronger relationships between government and residents across the U.S., U.K., Australia, New Zealand, and Canada. By simplifying interactions with residents, while disseminating critical information, Granicus brings governments closer to the people they serve—driving meaningful change for communities around the globe.


Want to know more? See more of what we do here.

Job Summary

Granicus is the leading provider of citizen engagement technologies and services for the public sector, bringing governments closer to the people they serve with the first-and-only Civic Engagement Platform. Granicus works with more than 5,500 government organizations and connects more than 280 million people in the largest Citizen Subscriber Network of its kind.

What Your Impact Will Look Like

Granicus is seeking a Site Reliability Engineer (SRE3) with strong AIOps capabilities to modernize reliability engineering through observability, automation, and AI-assisted operations. In this role, you will improve service reliability, reduce operational toil, accelerate incident response, and help build scalable, resilient platforms supporting both traditional and AI/ML-powered workloads. 

What your impact will look like 

AI, MCP & AIOps 

  • Lead adoption of AI-first SRE practices across monitoring, incident response, and automation 
  • Design and implement MCP-based integrations connecting systems like Elastic, Jira, and cloud platforms 
  • Build and operationalize AI agents for SRE workflows (incident triage, RCA, alert summarization, runbooks) 
  • Drive AIOps maturity: alert correlation, anomaly detection, assisted RCA 
  • Develop predictive models for capacity, failures, and incidents 

Day-to-day Operations 

On-call Production Support: 

  • Provide production support on a shift according to the team on-call roster. 
  • While not on call for production support, work on SRE projects and Tech support escalated and internal engineering/implementation team raised tickets  
  • Work on SREs backlog items. 
  • Leverage AI-assisted triage tools & MCP frameworks to prioritize alerts, detect anomaly patterns, and reduce noise during on-call 
  • Continuously improve AI-driven incident routing and recommendation systems to optimize response efficiency. 

 

Monitor and Maintain Systems: 

  • Proactively monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability. 
  • Effectively identifies & addresses monitoring and observability gaps  
  • Implements effective alerting & notifications, minimizing false alerts 
  • Creates and manages effective SRE Dashboards to report Key business metrics, SLAs, SLOs, SLIs & error budgets 
  • Ensure SREs are meeting or improving on established SLOs 
  • Proactively & effectively evaluates capacity planning to handle growth - scalability & traffic load 
  • Contributes to innovative solutions like AI Assistant for proactive issue detection & response 
  • Design and implement AI/ML-based anomaly detection for proactive identification of system degradation 
  • Utilize predictive analytics models for capacity forecasting, incident prediction, and failure prevention. 
  • Integrate AIOps platforms (like Elastic AI assistant) for intelligent alert correlation and root cause suggestions. 
  • Develop self-learning monitoring systems that evolve with application behavior. 

 

System reliability Improvements: 

  • Actively participates and tracks execution of SRE projects aimed at improving system reliability 
  • Effectively collaborates with cross teams to prevent reliability issues 
  • Reviews change management tickets to identify and mitigate potential risks to system reliability 
  • Ensure active participation in change activities and verify that accurate validations are performed by SRE & Engineering teams post implementation.  
  • Participate in architecture reviews & assess the impact of architectural decisions on system reliability 
  • Initiatives to perform chaos experiments to continuously learn and improve performance & stability of our systems 
  • Contributes to innovative solutions that enhance system reliability & scalability 
  • Drive initiatives to build self-healing systems using ML-based decision engines 
  • Use AI models to simulate failure scenarios and predict system behavior under stress (intelligent chaos engineering) 
  • Identify reliability risks using pattern detection across logs, metrics, and traces. 
  • Contribute to building adaptive scaling systems using ML-based workload prediction. 

 

Incident Management: 

  • Actively participate in troubleshooting and resolving incidents, performing root cause analysis, Incident postmortems and implementing long-term fixes to prevent recurrence. 
  • Acknowledge & quick recovery from incidents 
  • Maintains quality of Root cause analysis (RCA) and corrective action plans 
  • Proactively monitors, measures & adheres to optimal MTTR & MTTA requirements 
  • Improves quality of SOPs, Adapts AI tools to reduce MTTR  
  • Leverage AI-driven root cause analysis tools to accelerate incident diagnosis. 
  • Implement automated incident summarization and timeline reconstruction using NLP models 
  • Use AI copilots to recommend remediation steps during live incidents 
  • Analyze historical incidents using ML to identify recurring patterns and prevent reoccurrence 
  • Adopt AI-driven runbooks for faster response execution and decision-making. 

 

Automated Processes: 

  • Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention & reduce Toil 
  • Build intelligent automation pipelines that adapt based on system conditions. 
  • Develop AI-powered bots/assistants for routine operational tasks. 
  • Implement reinforcement learning-based optimization for operational workflows. 
  • Continuously identify automation opportunities using operational data insights. 

 

Documentation: 

  • Create and maintain accurate documentation for technology, processes, and troubleshooting. 
  • Ensure completeness and knowledge sharing across teams. 
  • Contributes to innovative solutions to build AI based knowledgebase 
  • Build and maintain an AI-powered knowledge base with semantic search capabilities. 
  • Implement NLP-based knowledge retrieval systems for faster troubleshooting 
  • Auto-generate documentation from system events, incidents, and architecture changes. 

 

Security: 

  • Implement and adhere to security best practices to protect our systems and data. 
  • Use AI-driven anomaly detection for security threats and unusual system behaviour. 
  • Collaborate with security teams to integrate ML-based threat detection and risk scoring systems.

 

Collaboration: 

  • Partner closely with Engineering teams to enhance reliability. 
  • Provide feedback on architecture and design. 
  • Participate in release reviews, risk assessments, and Go/No-Go decisions. 
  • Present monitoring and observability status to stakeholders 
  • Advocate for AI-first observability and reliability practices across engineering teams. 
  • Collaborate with data/ML teams to operationalize AI models in production environments (MLOps). 
  • Drive adoption of AI-assisted development and operational tooling across teams. 

 

You Will Love This Job If You Have

Tools and Technologies

  • 6+ years of experience in site reliability engineering, system administration, or a similar   role, with a proven track record of managing large-scale, high-availability systems 
  • Strong expertise in Linux/Unix, networking, distributed systems, and cloud platforms such as AWS, Azure, or Google Cloud 
  • Experience with scripting languages such as Python, Bash, or Ruby and programming languages (Go, Java, C++). 
  • Advanced knowledge of cloud, monitoring and Observability tools (Elastic, Prometheus, Grafana, Pingdom) 
  • Experience with infrastructure automation, CI/CD pipelines and configuration tools such as Terraform, Ansible, Chef, or Puppet. 
  • Experience integrating AIOps capabilities into observability stacks (metrics, logs, traces) for intelligent alerting, noise reduction, and root cause analysis. 
  • Experience working with AI-assisted coding tools such as Cursor, GitHub Copilot, or similar developer copilots 
  • Familiarity with Model Context Protocol (MCP) for integrating AI agents with enterprise systems (e.g., Jira, Elastic, cloud platforms) 
  • Ability to design or leverage AI agents for SRE workflows (incident triage, RCA generation, alert summarization, runbook execution) 
  • Experience building or integrating context-aware automation systems using MCP or similar frameworks 

 

Certifications:

Certifications such as AWS Certified Solutions Architect, AWS Certified Machine Learning – Specialty, or Google Cloud Professional DevOps Engineer are a plus. 

 

Job Info  

  • Shift work – Rotation shifts 
  • Remote - INDIA 

About Us

Don’t have all the skills/experience mentioned above? At Granicus, we are trying to build diverse, inclusive teams. We do not have degree requirements for most of our roles. If you don’t meet every requirement above but are excited to learn more, we encourage you to apply. We might just be able to find another role that could be a perfect fit!

 

Security and Privacy Requirements

  • Responsible for Granicus information security by appropriately preserving the Confidentiality, Integrity, and Availability (CIA) of Granicus information assets in accordance with the company's information security program.
  • Responsible for ensuring the data privacy of our employees and customers, their data, as well as taking all required privacy training in a timely manner, in accordance with company policies.

 

The Team

  • We are a remote-first company with a globally distributed workforce across the United States, Canada, United Kingdom, India, Armenia, Australia, and New Zealand.

 

The Culture

  • At Granicus, we are building a transparent, inclusive, and safe space for everyone who wants to be
    a part of our journey.
  • A few culture highlights include – Employee Resource Groups to encourage diverse voices
  • Coffee with Mark sessions – Our employees get to interact with our CEO on very important and
    sometimes difficult issues ranging from mental health to work-life balance and current affairs.
  • Microsoft Teams communities focused on wellness, art, furbabies, family, parenting, and more.
  • We bring in special guests from time to time to discuss issues that impact our employee
    population

The Impact

  • We are proud to serve dynamic organizations around the globe that use our digital solutions to make the world a better place — quite literally. We have so many powerful success stories that illustrate how our solutions are impacting the world. See more of our impact here.

Options

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.