Returning Candidate?

Senior Site Reliability Engineer

Position Type: Full-Time
Employee Type: Regular
Location Type: Remote

The Company

Serving the People Who Serve the People

Granicus is driven by the excitement of building, implementing, and maintaining technology that is transforming the Govtech industry by bringing governments and its constituents together. We are on a mission to support our customers with meeting the needs of their communities and implementing our technology in ways that are equitable and inclusive. Granicus has consistently appeared on the GovTech 100 list over the past 5 years and has been recognized as the best companies to work on BuiltIn.

Over the last 25 years, we have served 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers power an unmatched Subscriber Network that use our digital solutions to make the world a better place. With comprehensive cloud-based solutions for communications, government website design, meeting and agenda management software, records management, and digital services, Granicus empowers stronger relationships between government and residents across the U.S., U.K., Australia, New Zealand, and Canada. By simplifying interactions with residents, while disseminating critical information, Granicus brings governments closer to the people they serve—driving meaningful change for communities around the globe.

Want to know more? See more of what we do here.

Job Summary

Granicus is seeking a Senior Site Reliability Engineer (SRE) with strong AIOps, automation, and AI proficiency to modernize reliability engineering through observability, intelligent incident response, and responsible AI-assisted operations. In this role, you will improve service reliability, reduce operational toil, accelerate incident response, and help build scalable, resilient platforms supporting traditional, cloud-native, and AI/ML-powered workloads. The role will also help operationalize AI-enabled SRE practices such as alert intelligence, assisted root-cause analysis, runbook automation, telemetry summarization, and governed self-healing workflows with appropriate human approval and audit controls.

What Your Impact Will Look Like

Provide on-call production support using data-driven triage and AI-assisted insights to improve response speed and quality.
Investigate customer and internal issues, support high-priority escalations, and drive rapid service restoration.
Build and improve AI-assisted workflows for alert correlation, anomaly detection, telemetry summarization, noise reduction, incident enrichment, and controlled automated remediation.
Design and maintain observability across logs, metrics, traces, and events using platforms such as ELK/OpenSearch and cloud-native monitoring tools.
Lead incident troubleshooting by using telemetry, event correlation, deployment context, AI-generated summaries, and historical incident patterns to accelerate root cause identification and preventive fixes.
Develop automation, runbooks, ChatOps workflows, and self-healing capabilities with human-in-the-loop approval, confidence thresholds, rollback plans, and audit trails.
Drive system improvements that strengthen reliability, scalability, performance, and operational resilience.
Partner with engineering, platform, and product teams to improve deployment safety, operational readiness, and service reliability.
Maintain runbooks, troubleshooting guides, knowledge bases, and post-incident documentation that can be consumed by AI assistants and on-call engineers to improve readiness and knowledge sharing.
Support capacity planning, performance tuning, SLO-based reliability practices, and proactive risk reduction.
Apply security, privacy, access control, data protection, prompt safety, and operational guardrails across systems, automation, and AI-enabled services.

You Will Love This Job If You Have

- 5+ years in SRE, DevOps, system administration, or a similar role supporting large-scale, high-availability cloud environments.
- Strong expertise in Linux/Unix, networking, distributed systems, and cloud platforms such as AWS, Azure, or Google Cloud.
- Hands-on experience with observability platforms such as ELK, OpenSearch, Prometheus, Grafana, or similar tools.
- Strong understanding of AIOps concepts including anomaly detection, alert correlation, event deduplication, intelligent alerting, AI-assisted RCA, incident summarization, predictive signals, and automated or guided remediation.
- Strong scripting and automation skills using Python, Bash, Go, Java, or similar languages.
- Experience with infrastructure automation and configuration tools such as Terraform, Ansible, Chef, or Puppet.
- PreExperience supporting AI/ML or GenAI-enabled platforms, including model deployment, inference reliability, observability, latency, capacity, cost controls, and services such as AWS Bedrock, SageMaker, Azure AI, Google Vertex AI, or equivalent platforms.
- Familiarity with LLMOps or MLOps practices, including prompt/version management, evaluation, monitoring, drift detection, retrieval or knowledge-base integration, cost awareness, and model lifecycle support.
- Hands-on exposure to AI-enabled observability, incident management, or automation tools such as PagerDuty AIOps, Datadog Bits AI, Dynatrace Davis, New Relic AI, Splunk ITSI, Amazon Q, Azure AI/Copilot capabilities, OpenTelemetry, Prometheus, Grafana, OpenSearch, or similar platforms.
- Good understanding of incident management, RCA, runbook automation, SLOs, and continuous improvement practices.
Certifications Preferred: Certifications such as AWS Certified DevOps Engineer, AWS Certified Machine Learning – Specialty, Google Cloud Professional DevOps Engineer, Azure DevOps Engineer Expert, Kubernetes/CKA, or relevant AI/ML, AIOps, observability, or cloud automation certifications are a plus.

About Us

Don’t have all the skills/experience mentioned above? At Granicus, we are trying to build diverse, inclusive teams. We do not have degree requirements for most of our roles. If you don’t meet every requirement above but are excited to learn more, we encourage you to apply. We might just be able to find another role that could be a perfect fit!

Security and Privacy Requirements

Responsible for Granicus information security by appropriately preserving the Confidentiality, Integrity, and Availability (CIA) of Granicus information assets in accordance with the company's information security program.
Responsible for ensuring the data privacy of our employees and customers, their data, as well as taking all required privacy training in a timely manner, in accordance with company policies.

The Team

We are a remote-first company with a globally distributed workforce across the United States, Canada, United Kingdom, India, Armenia, Australia, and New Zealand.

The Culture

At Granicus, we are building a transparent, inclusive, and safe space for everyone who wants to be
a part of our journey.
A few culture highlights include – Employee Resource Groups to encourage diverse voices
Coffee with Mark sessions – Our employees get to interact with our CEO on very important and
sometimes difficult issues ranging from mental health to work-life balance and current affairs.
Microsoft Teams communities focused on wellness, art, furbabies, family, parenting, and more.
We bring in special guests from time to time to discuss issues that impact our employee
population

The Impact

We are proud to serve dynamic organizations around the globe that use our digital solutions to make the world a better place — quite literally. We have so many powerful success stories that illustrate how our solutions are impacting the world. See more of our impact here.

Options

ApplyApply

Email this job to a friendRefer

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.

Application FAQs