Site Reliability Engineer 3

Location IN-KA-Bangalore
ID 2025-9621
Position Type
Full-Time
Employee Type
Regular
Location Type
Hybrid

The Company

Serving the People Who Serve the People

 

Granicus is driven by the excitement of building, implementing, and maintaining technology that is transforming the Govtech industry by bringing governments and its constituents together. We are on a mission to support our customers with meeting the needs of their communities and implementing our technology in ways that are equitable and inclusive. Granicus has consistently appeared on the GovTech 100 list over the past 5 years and has been recognized as the best companies to work on BuiltIn.

 

Over the last 25 years, we have served 5,500 federal, state, and local government agencies and more than 300 million citizen subscribers power an unmatched Subscriber Network that use our digital solutions to make the world a better place. With comprehensive cloud-based solutions for communications, government website design, meeting and agenda management software, records management, and digital services, Granicus empowers stronger relationships between government and residents across the U.S., U.K., Australia, New Zealand, and Canada. By simplifying interactions with residents, while disseminating critical information, Granicus brings governments closer to the people they serve—driving meaningful change for communities around the globe.


Want to know more? See more of what we do here.

Job Summary

​​Granicus​ is seeking an experienced and highly skilled Senior Site Reliability Engineer (SRE) to join our SRE team. As a Senior SRE, you will play a pivotal role in ensuring the reliability, scalability, and performance of our services. You will lead efforts in building and maintaining a robust infrastructure for our business applications, automating processes, and guiding the team to implement best practices in site reliability, adopting emerging technologies, including AI-based tools, to streamline operations and deliver measurable productivity improvements

 

What Your Impact Will Look Like

On-call Production Support:

  • Provide production support on a shift according to the team on-call roster.
  • While not on-call for production support , work on SRE projects and Tech support escalated and internal engineering/implementation team raised tickets 
  • Work on SREs backlog items.

Monitor and Maintain Systems:

  • Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability.
  • Proactively monitors the overall uptime and availability of critical Services
  • Effectively identifies & addresses monitoring and observability gaps
  • Implements effective alerting & notifications, minimizing false alerts
  • Creates and manages effective SRE Dashboards to report Key business metrics, SLAs, SLOs, SLIs & error budgets
  • Ensure SREs are meeting or improving on established SLOs
  • Proactively & effectively evaluates capacity planning to handle growth - scalability &  traffic load
  • Contributes to innovative solutions like AI Assistant for proactive issue detection & response

System reliability Improvements: .

  • Actively participates and tracks execution of SRE projects aimed at improving system reliability
  • Effectively collaborates with cross teams to prevent reliability issues
  • Reviews change management tickets to identify and mitigate potential risks to system reliability
  • Ensure active participation in change activities and verify that accurate validations are performed by SRE & Engineering teams post implementation.
  • Participate in architecture reviews & assess the impact of architectural decisions on system reliability
  • Initiatives to perform chaos experiments to continuously learn and improve performance & stability of our systems
  • Contributes to innovative solutions that enhance system reliability & scalability

Incident Management:

  • Actively participate in troubleshooting and resolving incidents, performing root cause analysis, Incident post mortems and implementing long-term fixes to prevent recurrence.
  • Acknowledge & quick recovery from incidents
  • Maintains quality of Root cause analysis (RCA) and corrective action plans
  • Proactively monitors, measures & adheres to optimal MTTR & MTTA requirements
  • Improves quality of SOPs ,Adapts AI tools to reduce MTTR

 

Automate Processes:

Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention.

 

Collaboration:

Partner closely with DevOps and Software Engineering teams to enhance system reliability. Provide constructive feedback on design and architecture, actively support and monitor change and release processes, participate in risk assessments, PI planning, change reviews, and Go/No-Go decision calls. Actively present monitoring and observability status both pre- and post-release to all stakeholders involved in the release or change process.

 

 

Documentation:

  • Create and maintain documentation for technology, architecture, processes, procedures, and troubleshooting guides and provides knowledge sharing within the team
  • Ensures completeness & accuracy of information
  • Contributes to innovative solutions to build AI based knowlegebase

Security: Implement and adhere to security best practices to protect our systems and data.

You Will Love This Job If You Have

  • Technical Skills:  
    • Expertise in Monitoring/Observability - Elastic & Cloud watch/Azure Monitor
    • Expertise in Linux/Windows OS & networking
    • Advanced knowledge of Cloud services (AWS & Azure).
    • Advanced knowlege of Container Technologies - Dockers & Kubernetes (K8s)
    • Proficiency on Database/Queries- MSSQL,Postgres,Mongodb,Mysql 
    • Proficiency in Scripting - Python/Powershell / Bash
    • Working experience on CI/CD Tools - Gitlab/Azure Devops or similar tools
    • Working experience on IaC Tools -Terraform/Ansible
    • Working experience on Configuration management -Chef
    • Working experience on Incident response - Pagerduty, Jira 
    • AI Tools - Copilot, VS code AI agents or similar
  • ​Education: Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.

 

  • Experience: Atleast 8+ years of relavant experience in site reliability engineering with a proven track record of managing complex, medium to large scale high-availability systems.

 

  • ​Problem-Solving: Strong analytical and problem-solving skills with the ability to diagnose and resolve complex issues efficiently. 

 

  • ​Communication: Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders. 

 

  • ​Leadership: Demonstrated ability to lead and mentor a team, drive projects to completion, and manage cross-functional initiatives.​​ 

 

  • ​Certifications: Relevant certifications such as Elastic Certified Observability Engineer, AWS Certified Solutions Architect, Certified Kubernetes Administrator,  or those with Equivalent hands-on experience is highly valued.

 

These statements are intended to describe the general nature and level of work being performed by employees assigned to this job.  This is not intended to be an exhaustive list of all responsibilities, duties, and skills required of employees assigned to this job. 

  • The position requires flexibility in working hours to cover for any overlap and attend team meetings as needed. 
  • Shift Time: 24/7 on-call, including weekends (typically one week every month) 

About Us

Don’t have all the skills/experience mentioned above? At Granicus, we are trying to build diverse, inclusive teams. We do not have degree requirements for most of our roles. If you don’t meet every requirement above but are excited to learn more, we encourage you to apply. We might just be able to find another role that could be a perfect fit!

 

Security and Privacy Requirements

  • Responsible for Granicus information security by appropriately preserving the Confidentiality, Integrity, and Availability (CIA) of Granicus information assets in accordance with the company's information security program.
  • Responsible for ensuring the data privacy of our employees and customers, their data, as well as taking all required privacy training in a timely manner, in accordance with company policies.

 

The Team

  • We are a remote-first company with a globally distributed workforce across the United States, Canada, United Kingdom, India, Armenia, Australia, and New Zealand.

 

The Culture

  • At Granicus, we are building a transparent, inclusive, and safe space for everyone who wants to be
    a part of our journey.
  • A few culture highlights include – Employee Resource Groups to encourage diverse voices
  • Coffee with Mark sessions – Our employees get to interact with our CEO on very important and
    sometimes difficult issues ranging from mental health to work-life balance and current affairs.
  • Microsoft Teams communities focused on wellness, art, furbabies, family, parenting, and more.
  • We bring in special guests from time to time to discuss issues that impact our employee
    population

The Impact

  • We are proud to serve dynamic organizations around the globe that use our digital solutions to make the world a better place — quite literally. We have so many powerful success stories that illustrate how our solutions are impacting the world. See more of our impact here.

Options

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.